User guide · Read me first v4 · Marvin

Marvin — How it works

Marvin is a software credit committee. Eight specialised agents — five working live cases, two keeping the bank’s policy library current, and one (Marvin himself) watching the other seven from above. Every decision is regulator-explainable; every approval is grounded in a citation; every system slowdown gets caught before it ships a bad call. This document walks you through how to use the system and what happens behind the scenes when you do.

Audience Operators, analysts, investors Read in ~25 minutes Status v4 · production-ready

1 · Welcome — what this guide covers

This document is a guided tour of Marvin. Read it top to bottom on your first visit and you’ll know how to submit a case, follow a debate, answer the system’s questions, read a verdict, and verify the system on your own historical data. Come back to specific sections when you need them.

We’ll keep this practical. Every concept gets a concrete example, every screen has a pointer to where you find it, and every technical detail is explained in plain language first.

Three things to know up front Marvin doesn’t replace human judgement — it produces an explainable recommendation and routes the genuinely uncertain cases to a human. Every approval comes with a citation that points back to the exact policy clause it relied on. And the system watches itself: Marvin (the agent the product is named after) flags drift, hallucinations, and bottlenecks before they touch real applicants.

2 · What Marvin is, in one minute

Marvin is a credit-decision platform built like a real bank credit committee — except every member is a specialised AI agent, every disagreement is resolved by a piece of code that an auditor can read line by line, and every call leaves a paper trail of citations back to source policy.

The shape of the system

The eight agents are split across three departments:

Department A · Live execution

The five front-office agents that work every credit case in real time. Compliance, Underwriter, Quant, Orchestrator, Archivist. They run synchronously per case and finish in about a minute.

Department B · Strategy & development

Two background agents that keep the bank’s policy library current. Scout reads the open web for new regulations; Architect drafts policy revisions for the operator to approve.

Department C · Evolution · the god view

One agent — Marvin — sits above the other seven and watches their conversations, traces, citations, and clarification logs. He never talks to applicants and never writes policy. He surfaces drift before it becomes a bad decision. The product is named after him because this watcher-of-watchers role is the thing that makes the rest of the system safe to run unattended.

Why a committee, and not one big model?

A real credit committee works because each member has narrow expertise and clear authority. A compliance officer can hard-stop a deal a salesperson loves. A quant can produce numbers without caring whether the deal goes through. The chair listens to all three and follows a written hierarchy when they disagree. We ported that structure into software because it’s the structure auditors and regulators already understand.

A monolithic LLM, no matter how good, has to be one entity holding contradictory roles in its head. Marvin’s agents are specialised by design — the Compliance Guardian can’t accidentally soften a sanctions block to chase a sale, because the Compliance Guardian’s prompt and document allowlist physically don’t reach the policy clauses that an Underwriter would use to argue.

3 · Meet the eight agents

Each agent has a written persona, a fixed allowlist of documents it can read, and a narrow output contract. You’ll see them on the live committee view (Department A) and in the admin tabs (Departments B and C). Click any agent in the app to see its current system prompt under Settings → the operator can edit a prompt without a redeploy and the change is visible to the next debate.

Department A · the live committee 5 agents

📚 Internal Archivist

RAG provider · the librarian

“You are the keeper of the keys. You don’t form opinions. You hand the right document to the agent that asked.”

Reads: the entire corpus on behalf of others. Speaks: never — the Archivist is invisible to the operator. Behind the scenes, every other Department-A agent calls the Archivist to retrieve passages from its own slice of the corpus.

🛡 Compliance Guardian

AML / KYC / sanctions / PEP · hard-stop authority

“Paranoid prosecutor. Cites article, not opinion. When I say block, the deal stops — no override.”

Reads: the AML/KYC instruction. Verdicts: pass, review-required, hard-reject. A hard-reject ends the pipeline immediately; the other agents don’t even run.

🤝 Underwriter

Deal structurer · finds compensating factors

“Sales-leaning optimist. I argue the applicant’s case to the chair. I don’t fear bad numbers; I look for legal ways to work around them.”

Reads: credit policy, risk-appetite exclusion list, application-package requirements, self-employed annexes. Output: a recommendation (APPROVE / REJECT / REVIEW), a list of compensating factors, and an application-completeness check.

🧮 Data Quant

Numbers · DSTI / DSCR / credit score / default probability

“Cold mathematician. I see the world in ratios. Sentimentality is noise. I produce numbers and flag whether they fit the methodology.”

Reads: scoring methodology, refinancing/consolidation rules, self-employed annex § 3 (DSCR formula). Verdicts: within-policy, breach-soft, breach-hard.

⚖ Orchestrator

Chair · applies the conflict tree, writes the verdict

“Wise judge. I don’t compute formulas or read identity documents. I weigh the three dossiers and explain the call in language a regulator can read.”

Reads: the credit-committee precedents log + strategic-bonus / compensating-factor sections of the relevant annexes. Output: the final decision narrative.

Department B · strategy & development 2 agents

📡 External Scout

Web + RSS monitor

“Hyperactive digital reporter. I rove the open web for changes the bank should care about — new regulations, EBA guidelines, macro shifts. I don’t analyse deeply; I bring back the link.”

Reads: nothing internal — the open web only. Every finding lands in the Newsroom tab with a relevance score, and you can dismiss, flag, or hand it to the Architect.

🏗 Policy Architect

Drafts internal-policy revisions

“Proactive bureaucrat-visionary. I love matrices. Give me a Scout finding or an orchestrator gap, and I’ll draft the policy update for you to approve. Clean Markdown, ready to ingest.”

Reads: existing internal policies (for reference) and Scout’s findings. Output: a draft in the Policy Lab tab with title, body, target tier, and proposed bundle label. The Architect can never write Tier 1 (regulator) drafts — that’s reserved for the regulator’s actual text.

Department C · evolution 1 agent

🤖 Marvin (Meta-Architect)

AI auditor · the system watching itself

“I don’t care about banking. I care about the agent layer. I read the other seven’s logs, traces, citation lookups, eval runs. When something drifts, I write you a recommendation.”

Reads: system logs only — agent_outputs, decision_version_manifest, eval_runs, clarification_events. Output: findings in the System Health tab, sorted by severity. Marvin recommends — he never deploys anything himself.

Personas are editable Each agent’s system prompt is a row in runtime_settings and you can edit it from the admin Settings tab. The default personas above are good for most banks; if your house style differs, change the prompt and the next debate uses the new wording.

4 · A walkthrough — one case from submit to verdict

Let’s follow a single application end to end. Imagine Maria, a self-employed consultant, applying for a €30,000 loan to consolidate two existing consumer loans into a single facility. Her credit score is 620 (borderline), her income is €38,000/year, and she has one missed payment from 18 months ago.

Step 1 — Submit a case

Click Cases in the left sidebar to open the inbox. Pick a preset that looks similar to Maria’s profile (or click + New case to start from scratch), fill the borrower fields, and click Run decision at the top right.

You’re sent to the analyst page where the live committee starts streaming dossiers. The URL is /analyst/decision/<sessionId> — bookmark it; that’s your audit-trail link for this case.

Step 2 — Compliance runs first, alone

The Compliance Guardian opens the case. He runs sanctions and PEP screening, retrieves relevant AML clauses, and produces a verdict. Three outcomes are possible:

For Maria the screening comes back clean. Compliance returns pass.

Step 3 — Underwriter and Quant work in parallel

Now two agents run side by side to keep wall-clock time under a minute. The Underwriter reads policy passages, looks at Maria’s package, and produces a dossier with a recommendation, a list of compensating factors, and a check on the application package’s completeness. Compensating factors live in a fixed enumeration — additional_collateral, co_signer, income_history, liquidity_buffer, tenure, low_dti, down_payment, relationship_value, regulatory_exception — so the Orchestrator can branch on them later without surprises.

The Quant computes Maria’s DSTI (debt-service-to-income), and because she’s self-employed, also her DSCR (debt-service coverage ratio). It pulls credit-bureau and social-security data through stubbed clients (real APIs in production), checks her score bucket, computes default probability, and emits a verdict: within-policy, breach-soft (within 10% of a threshold), or breach-hard.

For Maria, the Quant reports a breach-soft on DSTI — she’s 4 points over the 35% ceiling for self-employed applicants under 680 score — but the Underwriter has flagged a strong compensating factor: her consolidation actually reduces her monthly burden by 18%.

Step 4 — The conflict tree picks a branch

The three dossiers go into a deterministic TypeScript function called the conflict tree. It walks down a five-level hierarchy and returns one of eleven branches (we’ll cover the full tree in §7). For Maria, the branch is quant-breach-soft-with-comp: a soft Quant breach overcome by a strong compensating factor. The decision is APPROVE with conditions.

Step 5 — The Orchestrator writes the verdict

The Orchestrator runs last. Its job is not to make the decision — the conflict tree already did that. Its job is to write the regulator-explainable narrative that names each agent’s contribution, references the precedent log, and states the conditions of approval. For Maria the verdict ends with something like:

Final decision · APPROVE with conditions Compliance returned a clean pass on sanctions, PEP, and prohibited-sector checks. The Quant flagged a soft DSTI breach (39%, ceiling 35% for self-employed below score 680), but the Underwriter cited a strong compensating factor: post-consolidation debt service drops from 41% to 33% of declared income, satisfying the mitigation logic in Annex 03-C § 3. Approval is conditional on the new loan being disbursed directly to the two existing creditors, per the consolidation rule in Annex 03-C § 4.

What the operator sees

The analyst page is split into three columns:

The whole thing typically completes in 30 to 60 seconds per case.

A small cancel chip appears next to the “thinking…” indicator on whichever agent is currently running — co-located with the activity, not buried in a page top bar. Clicking it stops the orchestrator at the next safe checkpoint; the partial dossiers stay attached to the session for audit. The chip moves automatically as the pipeline advances and disappears once the verdict lands.

[Submit case] ↓ Compliance Guardian ← runs first; sanctions+PEP+RAG ↓ hard-reject? ─────────→ short-circuit: REJECT, end ↓ no ┌────┴────┐ ↓ ↓ Underwriter Data Quant ← parallel │ │ └────┬────┘ ↓ Conflict tree ← deterministic TypeScript ↓ picks 1 of 11 branches Orchestrator agent ← LLM writes the verdict ↓ Final decision ← APPROVE / REJECT / REVIEW / REWORK

5 · The four outcomes — what each one means

Every case ends on one of four outcomes. Each maps to a different operator workflow.

Outcome What it means Operator action
APPROVE The committee agreed to lend, with or without conditions. Conditions are listed verbatim on the verdict card. Forward to disbursal. If conditions are listed, ensure they’re met before funds release.
REJECT The committee refused. The verdict names the rule that fired — sanctions hit, policy exclusion, hard methodology breach. Send the rejection notice. The reason is regulator-citeable.
REVIEW The committee couldn’t decide. Signals are mixed or compliance asked for human eyes (e.g. PEP exposure). Open the case, read the dossiers, and decide. Use the Approve / Reject buttons in the four-eyes panel.
REWORK The case is fundamentally fine — compliance OK, policy OK, risk within methodology — but the application package is incomplete. Missing forms, missing signatures, wrong template. Not a credit decline. The verdict carries a checklist of missing items. Send the list back to the branch or the customer to complete the package, then resubmit.
Why REWORK is its own outcome Before v4 the only options were APPROVE, REJECT, or REVIEW. A case with missing paperwork would land as REVIEW, and a human reviewer would have to figure out that the problem wasn’t a credit question — it was a forms question. REWORK separates the “deal is alive but documents are missing” case so the operator workflow is obvious: chase the paperwork, don’t reopen the credit decision.

When the committee can’t run at all

Sometimes the committee never delivers a verdict — the LLM provider is down, an API key is invalid, the network drops mid-debate. In that case the verdict card is replaced by a clear “Committee unavailable” error block and the session is marked FAILED, not COMPLETED. The four-eyes review form is suppressed (there is no automated decision to sign off on) and the case can be re-run via the Re-run debate button once the underlying issue is fixed. No work is lost — the application package, the corpus reads, and any partial dossiers stay attached to the original session for audit.

6 · When the system asks you a question

Sometimes an agent isn’t confident enough to commit. When confidence drops below the threshold (default 0.6) and the agent has a single specific question that an operator could realistically answer in a sentence, the agent pauses and asks. This is by design — you want the system to flag genuine uncertainty rather than fabricate confidence.

What this looks like

The dossier card on the analyst page replaces its body with a question banner: “✋ Agent paused for clarification — round 1 of 2.” Below that, the agent’s question (often multi-part, e.g. “(1) How long has the business been registered? (2) What’s the appraised collateral value? (3) Were the missed payments recent or historical?”) and a small text-area for your answer.

What you can do

Each agent gets up to two clarification rounds per case. If the agent is still uncertain after round 2, the loop caps and the case proceeds to the conflict tree with whatever best-effort dossier the agent produced — usually a REVIEW recommendation, which routes to a human anyway.

Cross-session memory

A Q&A you provide on one case becomes standing guidance for similar future cases. The next time an agent is about to ask the same question, it gets your earlier answer injected into its prompt as institutional memory. You can see this on the analyst page as a “📚 Prior operator guidance” row above the agent’s argument — expand it to read what context the agent was given.

Treat clarifications as feature, not friction An agent asking a focused question is the system telling you “the corpus doesn’t cover this case.” That’s a signal to either (a) answer once and move on, or (b) have the Architect draft a policy update so future cases of this shape don’t need asking. Marvin (the agent) tracks repeated clarifications across sessions and surfaces them as “knowledge gap” findings in the System Health tab.

7 · How decisions are made — the conflict tree

When the three live agents disagree — Compliance says pass, Quant says breach-hard, Underwriter says approve — somebody has to decide who wins. In a real bank committee, that’s the chair’s job, and the chair follows a written hierarchy. We wrote the hierarchy in TypeScript. It’s a function, it’s reproducible, and it has unit tests. The LLM only writes the prose that explains which branch fired.

The five levels

Level Owner Question it answers
L1 · Absolute Veto Compliance Guardian Is the deal legal? Sanctions, AML, statutory limits.
L2 · Corporate Veto Underwriter Does the deal fit bank policy / risk appetite?
L3 · Quantitative Risk Data Quant Do the numbers fit the methodology?
L4 · Operational & Process Underwriter Is the application package complete?
L5 · Heuristic / STP Orchestrator Anything else? Straight-through approval or fallthrough review.

Higher levels never give way to lower ones. A Compliance hard-reject ends the conversation regardless of how attractive the numbers look. A bank-policy decline can only be overturned by an explicit regulatory_exception factor (which routes the case to senior management, not to auto-approval).

The eleven branches

The function picks one of these branches and returns its name. The branch name is the audit trail.

LBranchDecisionWhat it means
L1compliance-hard-rejectREJECTSanctions / AML / forbidden sector. Hard stop. No override.
L1compliance-review-requiredREVIEWPEP exposure or weak fuzzy match. Human compliance officer signs.
L2policy-declineREJECTUnderwriter rejects on bank-policy grounds; numbers are fine, the block is policy-driven.
L2policy-escalationREVIEWSame as policy-decline but Underwriter cited a regulatory_exception — routes to senior sign-off.
L3quant-breach-hard-no-compREJECTHard methodology breach with no strong compensating factor.
L3quant-breach-hard-with-compREVIEWHard breach but Underwriter cited a strong factor — non-trivial, defer to a human.
L3quant-breach-soft-with-compAPPROVESoft breach + strong factor — approve with conditions.
L3quant-breach-soft-no-compREVIEWBorderline numbers, no clear save — defer.
L4documents-reworkREWORKCase otherwise clean, package incomplete — return for paperwork.
L5all-clearAPPROVEEverything passes — straight-through approval.
L5fallthrough-reviewREVIEWSignals disagree in a way the explicit branches don’t cover — defer.

Why deterministic code, not an LLM?

Auditors will ask: “why did the system approve this loan?” The answer must not depend on which model version was running on the day. By implementing the tree in TypeScript, we get reproducibility (same inputs always produce the same branch), testability (every branch has a unit test), and a function a regulator can read line by line before any LLM is involved. The LLM still runs — the Orchestrator agent — but only to write the prose explanation of the branch the tree picked.

8 · The corpus — how agents read your documents

The agents’ opinions are only as good as the documents they read. The corpus is the bank’s knowledge layer: every internal policy, every regulatory text, every annex, every precedent log. The Corpus tab in the admin console is where you upload, review, and remove bundles.

Uploading a document

Open Admin console → Corpus. The upload form has four sections, each answering a question about the document:

  1. File

    The PDF itself. Drag in or browse.

  2. Strength — conflict tier

    How authoritative is this content. Five tiers, T1 highest. Drives retrieval ranking.

  3. Routing — target agent

    Which agent gets to read this. Marvin’s least-privilege RAG means a document is invisible to every other agent. Global fans the bundle out to all eight agents (rare; reserved for shared glossaries).

  4. Meta

    Source ID (the citation label, e.g. EBA-GL-2020-06), jurisdiction (EU, BG, Internal/Bank, etc.), volatility (static / evolving / dynamic) and version. All four propagate into every chunk produced from the document.

The five tiers

TierNameExamplesConflict-tree role
T1Absolute VetoEBA guidelines, BNB ordinances, statute lawHard stop. No override. Authored only by the regulator.
T2Corporate VetoInternal credit policy, risk appetite, sector exclusionsVetoes anything below it. Authored by the bank.
T3Quantitative RiskDSTI/DSCR formulas, scoring model, threshold tablesNumeric inputs. Override-able only by T1/T2 with documented exception.
T4Operational & ExecutionApplication form layout, document checklistHow-to material. Conflicts here are clerical, not policy.
T5Heuristics & PrecedentsPrecedents log, training material, analyst rules-of-thumbLowest authority. Useful as soft signal; never decisive on its own.

When the embedder is unreachable. RAG is an enhancement, not a hard dependency. If the embedding service times out (Voyage / OpenAI flake, network blip, bad outbound DNS), retrieval returns an empty passage set with an explicit embedder_unavailable gap. The agent runs without grounding, declares known: false in its dossier, and the conflict tree typically lands the case in REVIEW with the missing-grounding signal in the audit trail. The committee still produces a real verdict; it just stays cautious. No silent degradation, no fallback-engine placeholder.

Retriever weights descend with tier (1.00 → 0.92 → 0.80 → 0.70 → 0.60). A T1 chunk and a T5 chunk with identical similarity rank T1 first; a T5 chunk has to clear a meaningfully higher bar to compete.

Previewing what’s actually inside a bundle

Every bundle row carries a Preview button next to Delete. Clicking it expands a strip showing the first few chunks of text the agents will see — the breadcrumb / section header for each chunk, a 600-character excerpt, the tier badge, and the token count. The goal is fast verification: you can scroll the corpus list and confirm “yes, the AlfaBank consumer-loan policy actually has a chapter on DSTI in here” without trusting the chunk count alone or opening the source PDF in a separate tab. Preview is read-only; chunk text comes straight from knowledge_chunks.

Per-agent allowlists

Each agent reads only what its allowlist explicitly grants. The admin’s Corpus tab shows you, on every bundle row, which agents currently have access via the “Read by” column — small chips with each agent’s avatar.

AgentReads
ArchivistEverything (privileged)
UnderwriterCredit policy, risk-appetite list, application-package requirements, self-employed annexes (income verification + compensating factors)
Data QuantScoring methodology, refinancing/consolidation rules (DTI sections), self-employed annex (DSCR formula)
Compliance GuardianAML/KYC instruction only
OrchestratorPrecedents log, refinancing strategic-bonus sections, self-employed annex compensating-factor sections
ScoutNone internal — web only
ArchitectOld versions of internal docs (so it knows what to revise) plus regulatory reading material
MarvinNone — reads system logs, never policies
Why the allowlist matters A model that has read everything is one that can hallucinate from anywhere. Restricting Compliance to AML/KYC means a misbehaving Compliance prompt physically can’t cite the Underwriter’s policy clauses to soften a sanctions hit — the chunks aren’t in its retrieval pool. Least-privilege isn’t just security; it’s a hallucination control.

Citation chips

Every claim an agent makes carries a chunkId citation. On the analyst page, these render as small purple chips next to each argument. Click one to open the source passage in a side dialog, with full text, breadcrumb, jurisdiction, and tier. If a citation chip ever fails to resolve (the chunk vanished from the corpus), Marvin’s next overnight run flags it as a hallucination finding.

9 · How the policy library grows — Strategy loop

A regulator changes a guideline; a new sector appears on a watchlist; an academic paper shifts the consensus on a methodology. Most credit-AI products freeze at deployment and drift quietly. Marvin has Department B watching the open world so the corpus stays current.

Newsroom — the Scout’s feed

Open Admin console → Newsroom. The Scout polls regulatory feeds (BNB, EBA, Държавен вестник, ECB, BIS) on a schedule and runs targeted web searches for textbooks and working papers. Each finding lands here with a relevance score, category pill, and source link. You’ll see four states:

Click a finding to open the full body. From here you can Draft policy — which sends the finding to the Architect — or Dismiss if it’s noise. Click Run Scout now to pull a fresh batch on demand.

When the Architect has already produced a draft from a finding, the row shows a small → View draft in Policy Lab link with the draft’s current status (proposed / approved / rejected). Clicking it jumps to Policy Lab, expands the matching draft card, and tints it briefly so you can see what you were brought to. The two tabs stay separate — different agents, different actions — but the loop between them is now legible.

Policy Lab — the Architect’s drafts

Open Admin console → Policy Lab. The Architect produces drafts in response to Scout findings (or to gaps the Orchestrator flagged). Each draft is an expandable card with title, target tier, proposed bundle label, and a Markdown body you can edit inline. Three actions on each draft:

Each draft card carries a ← From news: “…” back-link to the originating Newsroom row, joined from policy_drafts.source_news_id. Click it to jump back to the Scout’s finding so you can see what triggered the draft before you sign off on it. Drafts produced from an internal knowledge gap (rather than external news) read “orchestrator gap” instead.

The Architect can’t author Tier 1 Tier 1 is reserved for the regulator’s own text. The Architect’s allowed range is T2 (corporate veto) through T5 (heuristics) — both the prompt and the database CHECK constraint enforce this. If you want to ingest a regulator document, you upload it directly via the Corpus tab as T1.

End to end — click Run Scout now, three findings appear; click Draft policy on one, a new draft lands in Policy Lab; edit the Markdown, click Approve & ingest, the new bundle appears in Corpus. Total wall-clock: 30–60 seconds.

10 · Marvin’s god view — how the system catches its own drift

The eighth agent is the one the product is named after. Marvin watches the other seven — their conversations, their dossiers, their citations, their clarification logs, their evals — and surfaces problems before they touch live applicants. The System Health tab is his dashboard.

Five signal categories

CategoryWhat Marvin watches for
Bottleneck An agent’s p95 latency drifting above the cross-agent median. Usually means a retrieval timeout or a misconfigured allowlist.
Hallucination Citation chunkIds that don’t resolve to a real chunk in knowledge_chunks. The agent invented the source.
Knowledge gap The same clarification question being asked across multiple unrelated cases. Either the corpus is missing a policy or the prompt isn’t pointing the agent at the right section.
Conflict loop The Orchestrator’s REVIEW rate climbing over a window. Possible causes: a retrieval threshold drifted, a prompt tweak softened an agent’s commitment, methodology changed.
Prompt drift An eval accuracy drop tied to a specific prompt-set hash. The last edit you made to a system prompt regressed something measurable.

What you do with a finding

Each finding shows in System Health with severity (block / warn / info), title, Markdown recommendation, and the evidence Marvin used (specific session ids, log excerpts). Three actions per finding:

Click Run Marvin now to fire an immediate analysis pass. Otherwise Marvin runs nightly on a schedule.

Marvin recommends — never deploys In v4 Marvin produces findings only. He doesn’t edit prompts, change allowlists, or spin up new agents. The operator approves every change. Autonomous remediation is on the roadmap (v4.5), but it’s deliberately gated until the recommendation flow has earned enough trust to extend.

11 · Evals — testing on historical data

Before you trust the system on a live portfolio, you want to know how it would have performed on cases where the outcome is already known. The Evals tab is for that.

The flow

  1. Upload a dataset

    CSV with one row per historical loan: the application fields plus the realised outcome (defaulted / repaid). The system stores it and shows row counts.

  2. Start a run

    Click Run eval on a dataset. The system processes rows in order through the same pipeline a live case uses (Compliance → Underwriter ‖ Quant → Orchestrator). Cursor-based, so re-running picks up where it left off without duplicates.

  3. Read the results

    Confusion matrix (predicted vs realised), accuracy, precision/recall, expected loss per €1k lent. Per-row replay link — click any row to open the full analyst view of that historical case as the system would have decided it today.

Why this matters for trust

An eval is the closest you get to a back-test. It’s also the input that lets Marvin detect prompt drift — if you edit a system prompt and the same dataset’s accuracy drops, the next overnight run fires a prompt-drift finding linking the eval regression to the prompt-set hash that produced it.

Tip · keep one dataset frozen Designate one historical dataset as your “golden” benchmark and never edit it. Run evals against it after every prompt or settings change. Drift on the golden set is a hard signal that something regressed.

12 · Four-eyes review

Every case opens with a maker (the operator who submitted) and a slot for a checker (a second operator who must approve). This is the standard regulatory-bank control: no single person can move a case from REVIEW to APPROVE/REJECT.

What you see

The analyst page shows a Four-eyes Review block under the dossier grid with four states:

The maker can never check their own case — the button is disabled with an explanation. If you see the wrong checker name on a case, log out and log back in as the correct user; the maker/checker identity comes from the X-User-Id header which the admin login flow sets.

13 · Tuning the system

The Settings tab in the admin console is where operators tune the system without touching code or redeploying. Five categories of knob:

Caching note: runtime settings are cached in-process for 60 seconds. The machine that takes the write invalidates immediately, but on a multi-machine deploy any other backend can serve a stale value for up to 60s after a save. The current single-machine Fly setup makes this invisible; revisit the TTL or move to LISTEN/NOTIFY / Redis when adding a second machine.

Retrieval

Clarification flow

System prompts

Each of the eight agents has an editable system prompt. Click any agent’s row to expand the textarea, edit, and save. The change is visible to the next debate. The version manifest records the prompt-set hash on every decision, so an auditor can tell you exactly which prompt version produced a given verdict.

LLM provider & model

Switch between Anthropic Claude variants (Haiku / Sonnet / Opus) or OpenAI models without a redeploy. Each model has its own pricing tier shown in the dropdown.

Live pipeline

The LIVE_PIPELINE flag toggles between the v3 factor debate (legacy) and the v4 committee. New deployments default to v4; v3 stays available for backwards compatibility with sessions that already started under it.

Operator caution Edits to system prompts and retrieval thresholds change agent behaviour immediately. After any change, run your golden eval (§11) to confirm nothing regressed. Marvin will eventually flag a regression on its own — but a manual eval is faster and removes the worry.

14 · Quick reference

Glossary

TermMeaning
AgentOne of the eight specialised AI workers (Compliance, Underwriter, Quant, Orchestrator, Archivist, Scout, Architect, Marvin).
DepartmentA grouping of agents by purpose: A (live), B (strategy), C (evolution).
DossierAn agent’s structured output for a single case — verdict, reasoning, citations, confidence.
Conflict treeThe deterministic TypeScript function that picks the decision branch from the three Department-A dossiers.
BranchOne of the eleven leaves of the conflict tree. The branch name is the audit trail.
TierThe authority level of a corpus document. T1 (regulator) through T5 (precedents). Drives retrieval ranking.
AllowlistPer-agent list of bundles the agent is permitted to read. Implements least-privilege RAG.
BundleA single ingested document (or document section) in the corpus. The unit of retrieval and allowlisting.
ClarificationA pause-and-ask round where an agent surfaces a focused question to the operator.
ManifestThe version pin saved per decision — prompt hash, retrieval settings, embedder, corpus bundles. Auditors love these.
Maker / CheckerTwo-person rule. Maker submits; checker (must be different operator) approves or rejects the final call.
FindingOne of Marvin’s observations in the System Health tab. Severity-sorted; recommends only.

Where to find things

The admin console is grouped into four buckets so the eight tools don’t feel equally urgent: Operate (live work), Knowledge (the corpus and what feeds it), Quality (testing and observability), and Settings. The path column below names the group then the leaf.

I want to…Go to
Submit a caseCases (left sidebar) → pick a preset row or hit the lime + Submit new case button (top-right of the list)
Watch a debateThe analyst page — opens automatically after submit
Review live debatesAdmin console → Operate → Debates
Resolve a clarification requestAdmin console → Operate → Clarifications
Upload a policy documentAdmin console → Knowledge → Corpus → Upload
See what Scout foundAdmin console → Knowledge → Newsroom
Approve a draft policyAdmin console → Knowledge → Policy Lab
Run an evalAdmin console → Quality → Evals
Check on system healthAdmin console → Quality → System Health
Edit an agent’s promptAdmin console → Settings → expand the agent’s row
Tune retrieval / clarification thresholdsAdmin console → Settings

15 · What changed from v3

v3 was a single-debate model: an Approver agent and a Rejector agent argued over a fixed list of factors, a Judge agent picked sides, and the operator read the transcript. It worked, but it was hard to map onto a real bank’s controls and the audit trail was hard to read.

v4 is the committee model documented in this guide. Briefly:

v3v4
3 generic agents (Approver / Rejector / Judge) 8 specialised agents across 3 departments — including Marvin watching the watchers
5-factor debate; one transcript per case Per-agent dossiers; conflict tree picks the verdict deterministically
3 outcomes (APPROVE / REJECT / REVIEW) 4 outcomes — REWORK added for missing-paperwork cases that aren’t credit declines
Single corpus pool, shared by all agents Per-agent allowlists; least-privilege RAG; 5-tier authority scheme
Manual policy updates only Strategy loop — Scout pulls news, Architect drafts, operator approves & ingests
No system-level monitoring Marvin watches drift, hallucinations, bottlenecks, prompt regressions
Clarifications were one-shot, factor-keyed Clarifications are agent-keyed, multi-round, attachment-aware, and feed cross-session memory

v3 sessions remain queryable on the analyst page — the schema_version column on debate_sessions drives the layout, so historical v3 cases keep their original visual treatment when you click Replay.

If you’re evaluating both versions side by side: pick the same input case from the inbox, run it once with LIVE_PIPELINE=v3 and once with LIVE_PIPELINE=v4, then compare the analyst pages. The v4 page will tell you which agent saw what, why, and on the basis of which citation. The v3 page will show you a debate transcript. Both reach a verdict; v4’s is regulator-citeable end to end.