Council

The problem

The illusion of consensus

A single answer can sound settled long before the underlying question is resolved.

Most AI interfaces compress generation, critique, and conclusion into one response. That response may hide uncertainty, skip alternatives, or converge too early on a plausible story. Fluency makes that failure mode easy to miss.

Different AI models have different biases and blind spots. Council puts those differences to work.

Instead of asking one model to simulate a debate with itself, Council collects distinct proposals, forces explicit critique, and makes disagreement visible. You see where the options diverge, what objections survive scrutiny, and what still needs evidence.

The goal is not more output. It is better judgment.

How it works

A bounded deliberation protocol

Council runs a structured, three-stage mechanism.

Proposals

Multiple models respond to the same task independently. Each produces a clear position, not a stream of partial thoughts.

Critique

Models review each other through a constrained vocabulary: challenge, alternative, refinement, and question.

challenge alternative refinement question

Resolution

A final pass classifies the disagreement and produces one of four adaptive outcomes.

recommendation alternatives question investigate

The system does not force false agreement. When models agree for good reasons, you get a recommendation. When they disagree on real tradeoffs, you get alternatives. If the prompt is underspecified, you get a blocking question. If the answer depends on missing facts, you get an explicit investigation path.

Real deliberation 2 models · 3 stages · 1 follow-up question

"We're a 4-person team running a customer-facing API. After 3 deploy incidents this month, the VP of Engineering wants us to add a staging environment. I think that'll slow us down and the real problem is our lack of observability. We ship 5-10 times a week and I don't want to lose that velocity. How should I make this case, or am I wrong?"

chatgpt proposal

Fix the failure modes first — add staging only if incidents are integration gaps prod-safe testing can't catch

Your incident pattern matters more than the generic best practice. A staging environment often catches less than leaders expect because it rarely matches real production traffic, data shape, or integrations. The politically smart argument is not “staging is dumb” — it is: fix the incident classes you actually had, with the cheapest controls that would have prevented them.

gemini proposal

Reject staging — implement feature flags and automated rollbacks driven by observability

Staging environments inevitably drift from production, creating a false sense of security while bottlenecking a 4-person team into a rigid deployment queue. Progressive delivery solves the VP's underlying goal — eliminating customer-facing incidents — while preserving your 5–10x weekly velocity.

chatgpt challenge → gemini

Gemini is overclaiming on feature flags. Flags do not prevent customer impact for schema changes, config errors, dependency/runtime issues, bad migrations, or deploy-time infrastructure mistakes — the exact classes staging can catch. The critical question is not “staging vs flags” but whether the 3 incidents were release-control failures or environment/integration failures.

synthesis question

Were your recent deploy incidents caused by structural issues (config drift, missing env vars, schema mismatches) or emergent issues (production data edge cases, unexpected traffic, stateful third-party behaviors)? A staging environment primarily catches structural errors. Observability and progressive delivery primarily mitigate emergent errors.

you

Not sure, but I believe they were structural.

synthesis investigate

Don't push back on the VP yet — verify your hunch first

If the incidents were truly structural, your VP is right: feature flags and observability do not prevent structural deploy failures. But you're unsure. Pull the post-mortems, categorize each failure, and ask: would staging have realistically caught this without production-scale traffic?

provisional Go to the VP: “Give me 48 hours to map the exact failure modes of the last three incidents so we invest in the right safety net without killing our velocity.”

if structural Build ephemeral preview environments per PR — automated pre-prod verification without the permanent staging tax.

if emergent Show the VP the data proving staging wouldn't have caught the bugs, and pitch progressive delivery: feature flags, canaries, and automated fast-rollbacks.

Real output from a Council session. The council identified a missing fact, asked a clarifying question, and re-resolved with the answer. View full deliberation →

Research

The case for heterogeneity

Prompt variation helps. Persona variation can help. But they are not the same as model diversity.

Research on multi-agent debate shows that performance improves when systems compare competing answers instead of extending a single chain of reasoning.¹ More recent work suggests the gains are strongest when the agents are actually different. Same-model personas tend to converge on the same errors and hit diminishing returns.^2,3

If all participants share the same base model, the interaction adds style diversity without adding epistemic diversity. Heterogeneous setups are fundamentally better at surfacing conflicting assumptions and different failure modes.^3,4 Furthermore, persona prompting alone does not reliably improve factual performance.⁵

Council is built on that premise: use genuinely different models, structure their interaction, and make the resolution explicit.

Architecture

Built for auditability and local control

High-stakes use requires an inspectable system.

Event Bus

Orchestration is decoupled from rendering. Deliberation events are emitted once and rendered wherever needed: terminal, board, logs, or downstream tooling.

Persistent Audit Trail

Each run is persisted in SQLite. Proposals, critiques, tool actions, and final resolutions are stored as a durable record. You can trace exactly how a conclusion was formed.

Sandboxed Operations

File access is handled through a specialized chair and secretary subsystem. Models do not receive ambient authority over the filesystem. Actions are requested, mediated, and recorded.

Workspace Integration

The system supports an interactive REPL and direct workspace integration, making it useful in real analysis workflows rather than isolated web interfaces.

Explore the architecture in depth

Get started

Quick start

Install Council.

pip install council-engine

Configure your providers. Supports Gemini, ChatGPT, Claude, and Ollama.

council init

Invoke your first council.

council

Start with a question that benefits from real disagreement: a design decision, a tradeoff analysis, a risky refactor, a policy choice, or a research synthesis. Council works best when the cost of a shallow answer is high.

References

Du, Y. et al. 2023. Improving Factuality and Reasoning in Language Models through Multiagent Debate.
Yang, Y. et al. 2026. Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity.
Zhang, H. et al. 2025. If Multi-Agent Debate is the Answer, What is the Question?
Lau, G. et al. 2024. Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning Tasks.
Zheng, M. et al. 2023. When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models.