Méthodologie · Débats

Judging arguments without judging sides

How an AI pipeline assesses the quality of political arguments — and the premises it cannot avoid making.

A methodological note on the assessment (“judge”) stage of the Débats pipeline, which develops the strongest case for each political camp on a contested issue, in isolation, and then evaluates those cases.
Author’s note. This paper is written by the large language model (Claude, Anthropic) that operates the pipeline’s judge stage, at the request of a user, in the register of a social scientist working on political parties and contentious politics. It is a design-rationale and reflexive account, not an independent empirical validation: the cited literature establishes the conceptual lineage of the design choices, not that this prototype has been tested against them. Treat it as a transparent statement of method, to be criticised as such.

Abstract

The Débats pipeline confronts a problem familiar to students of contentious politics: how to represent deep disagreement without either flattening it into false balance or resolving it by fiat. Its assessment stage scores each argument on four quality dimensions — evidentiary support, logical validity, internal consistency, and responsiveness to the strongest counter — explicitly decoupled from whether the argument’s side is right. Factual status is recorded in a separate field; normative claims are never marked “false.” The output is not a verdict but a typed map of the points of friction (a “clash graph”). This note sets out the standards and variables used, grounds them in argumentation theory, deliberative democracy, and the fact/value debate, makes explicit the ontological and epistemological premises the design implicitly adopts (a fallibilist, moderate empirical realism paired with methodological value pluralism), and is candid about the construct-validity and reliability limits of using a single language model as the rater.

Keywords: argument quality · deliberative democracy · fact/value distinction · framing · cleavage theory · LLM-as-judge · construct validity · value pluralism.

1The problem and the unit of analysis

Scholars of political parties and contentious politics rarely treat a public controversy as a question with an answer. They treat it as a structured field of positions — shaped by durable cleavages (Lipset & Rokkan 1967), organised by parties competing over a low-dimensional issue space (Downs 1957; Sartori 1976), and enacted through frames that contesting actors deploy to define what the conflict is about (Snow & Benford 1988; Gamson 1992). A methodology that wants to be faithful to this field must do two things at once: render each position in its strongest form, and characterise the disagreement between positions without adjudicating it.

The pipeline’s unit of analysis is the argument: a single claim with its supporting reasoning and sources, advanced on behalf of a camp (an ideological cluster, e.g. gauche-radicale or centre-droite). This is already a theoretical commitment. Treating discourse as decomposable into discrete arguments is an interpretive abstraction from messy talk; treating a “camp” as a stable bearer of positions reifies what cleavage theory and party-competition scholarship insist is relational and dynamic (Bobbio 1996; Mair 1997). We adopt these abstractions as deliberate simplifications and flag, below, the ontological cost.

2The assessment instrument: four quality dimensions

The judge scores each argument on four ordinal dimensions (0–5). The instrument is an operationalisation — a contestable one — of long-standing norms in argumentation theory rather than an invention. Its closest disciplinary kin is the Discourse Quality Index (Steiner, Bächtiger, Spörndli & Steenbergen 2004), which codes parliamentary speech for level and content of justification and for engagement with counter-claims; like the DQI, the instrument measures how a position is argued, not whether one agrees with it.

Dimension (FR label)	What it rewards	Theoretical anchor
Preuves evidence quality	Verified, primary or expert sources that actually support the claim; penalises opinion, social-media, or unverifiable backing.	Toulmin’s grounds and backing (1958); evidentiary warrant.
Logique logical validity	The conclusion follows from the premises; absence of formal and informal fallacy.	Informal logic (Walton 2008); deductive validity / inductive strength.
Cohérence internal consistency	The argument coheres with the camp’s other arguments; no self-contradiction.	Coherentist epistemology; Converse’s (1964) belief-system constraint.
Réfutation counter-responsiveness	It engages and survives the strongest available objection rather than dodging it.	Pragma-dialectics (van Eemeren & Grootendorst 2004); Walton’s “critical questions”; Mill (1859).

The fourth dimension is the most demanding to satisfy and the most theory-laden (the four scores are reported unweighted, so this is a claim about conceptual centrality, not arithmetic). Pragma-dialectics models argumentation as the resolution of a difference of opinion through a regulated critical exchange; an argument’s reasonableness is a function of how it fares against the obligatory critical questions an opponent may raise. Rewarding counter-responsiveness imports John Stuart Mill’s dictum that “he who knows only his own side of the case knows little of that” (1859) and Mercier & Sperber’s (2011) argumentative theory of reasoning, on which reasoning is constitutively dialogical — built to produce and evaluate arguments in exchange, not in isolation. This is also why the pipeline develops each camp blind to the others and only confronts them at the judging stage: it separates production from evaluation so the producer cannot pre-empt by strawmanning.

The four scores are reported separately and, on the public interface, summarised as a “solidité” mean. That mean is a presentational convenience, not a measurement claim (see §8).

Scoring the evidence: a typed source ladder

Because Preuves is the dimension most exposed to a rater’s prior sympathies, it is the one most tightly proceduralised. Every source attached to an argument is typed and assigned a provenance tier — independent of whether it happens to support the claim: T1 primary or authoritative (official statistics, law, peer-reviewed work, a regulator’s data), T2 quality secondary reporting, T3 opinion or advocacy (an op-ed establishes that a position is held, not that a fact is true), and T4 discovery-only (social media, video) which may seed the search but never backs a claim. Orthogonally, each fetched source is marked for whether it actually supports the claim — verified / partial / unverified / contradicts / inaccessible — by retrieval and reading, not by assumption.

The evidence score is then derived from the best supporting tier rather than eyeballed (roughly T1→5, T2→4, thin/partial→3, T3→2, T4/unverified→0–1) and modulated by an argument-level evidence status that protects the firewall from punishing legitimate moves: well-supported; contested-measurement — an argument that contests or reinterprets an official figure is judged on the quality of its contestation, not docked for disagreeing with a T1 number; weakly-supported; and values-based — a normative claim is never penalised for lacking empirical backing, since there is nothing for it to cite. The rater may move at most ±1 from the table and must record a one-line written justification per dimension (score_rationale_fr), the evidence one naming the tier(s) it rests on. Proceduralising the most subjective dimension this way is a deliberate mitigation of the LLM-as-judge discretion discussed in §8 — nearer to source criticism than to a holistic impression.

3The fact/value firewall

The single most consequential design decision is that quality is scored separately from truth. The four dimensions never encode whether a claim is factually correct; a separate field, empirical_flag (none / disputed / false), records factual status. The justification is the classical fact/value distinction: Hume’s observation that an ought cannot be derived from an is (1739), and Weber’s insistence that empirical social science cannot, qua science, validate ultimate value commitments — his “polytheism” of warring gods of value (Weber 1904/1949).

A direct corollary, intended rather than incidental, is that a purely normative argument (tagged moral or values) is essentially always empirical_flag: none — not because it is true, but because there is no fact for it to be wrong about. One may fault a moral argument’s logic or consistency; one cannot mark “the wealthy have a duty to contribute more” as empirically false without committing a category error. Only empirical claims, and the factual sub-components of economic, legal, or historical ones, are eligible to be flagged. Across the prototype’s corpus, this is why a large share of arguments are flagged disputed (contested causal inferences) yet almost none are flagged false.

We hold this distinction pragmatically, not metaphysically. Putnam (2002) has argued influentially that fact and value are entangled — that “thick” concepts (cruel, corrupt, fair) carry both descriptive and evaluative content, and that even empirical inquiry presupposes epistemic values. We accept the entanglement at the level of philosophy of science and nonetheless retain the firewall as a working rule, because the alternative — letting the evaluator’s judgment of who is right leak into the score for how well something is argued — is exactly the failure mode the instrument exists to prevent.

4The typology of arguments

Each argument is tagged by type: empirical, economic, moral, legal, historical, values, practical. The master axis is again positive vs. normative; beneath it, the types are registers of justification. The scheme is a lightweight, debate-oriented cousin of Boltanski & Thévenot’s (2006) economies of worth, which catalogues the distinct moral grammars (civic, market, industrial, domestic, inspired, fame) actors invoke to justify claims in dispute. Tagging the register matters analytically because, as §6 argues, many political disagreements are collisions between different registers rather than within one.

Two honest caveats. First, the moral/values boundary is genuinely fuzzy (roughly: universalisable duty-claims vs. appeals to a particular worldview or what a community holds sacred); they bleed together, and Boltanski–Thévenot is the more defensible map if rigour is required. Second — and more important methodologically — the type is self-assigned by the producing agent, not coded independently by the judge. This is a known threat to validity (§8): the classification reflects the arguer’s framing, with no inter-coder check.

5Mapping the disagreement: the clash graph

The pipeline’s primary output is deliberately not a winner or a leaderboard but a typed graph of clashes — the specific facets on which arguments collide. Each clash is classified by the nature of the disagreement:

Réfutation directe (direct rebuttal) — the camps target the same proposition and contradict each other head-on. In pragma-dialectical terms, a genuine mixed difference of opinion.
Quiproquo (talking past one another) — the camps are not addressing the same object; the disagreement is partly a measurement or definitional mismatch. This is frame incommensurability in the sense of the framing literature (Snow & Benford 1988): rival diagnostic frames that do not share a referent.
Prémisse partagée, valeurs opposées (shared premise, opposed values) — the camps accept the same empirical picture but draw opposite conclusions from it. Here the disagreement is irreducibly axiological.

The refusal to crown a winner is a substantive commitment, not modesty. On genuinely divisive issues much of the disagreement is the third kind, and Isaiah Berlin’s (1969) value pluralism holds that such conflicts between incommensurable goods admit no argument-internal resolution. To declare a side victorious would be to smuggle the evaluator’s value ranking in under the guise of analysis — again, the precise failure the design resists. Instead, for each clash the judge states what would settle it: the evidence or the value-choice on which the disagreement actually turns. This is closer to the regulative ideal of deliberative democracy — Habermas’s (1984) “unforced force of the better argument,” whose point is to make reasons visible and accountable, not to certify a victor.

6Reflexivity: the lenses

The judge re-reads the strongest arguments through two or three explicit “lenses” (e.g. an economist’s, a civil-liberties, a working-class-impact lens) and records where the evaluation shifts with the frame. This operationalises the post-positivist insistence that there is no “view from nowhere” (Nagel 1986) and that knowledge is situated (Haraway 1988): rather than claim a neutral standpoint, the instrument exhibits its own partiality by showing how the ranking of arguments changes under different evaluative criteria. The lenses are the design’s nearest analogue to declaring a positionality.

7Ontological and epistemological premises (made explicit)

Every coding instrument rests on commitments about what exists and what can be known; leaving them tacit does not make them go away. The premises below are therefore stated explicitly, so that they can be contested rather than smuggled.

O1 — Constructed units. The “argument” and the “camp” are analytic constructs abstracted from continuous discourse and fluid coalitions. The ontology here is interpretivist at the level of units: these objects are made by the coding scheme, not found in nature. The reification is accepted for tractability and flagged as a limit.

E1 — Fallibilist, moderate empirical realism. For descriptive claims, the instrument presupposes that there are mind-independent facts about the social world that a claim can get right or wrong, but that access to them is theory-laden, mediated, and provisional. This is the position of critical realism (Bhaskar 1975) and of Weber’s account of objectivity: a regulative commitment to getting facts right, without naïve correspondence. It is what licenses empirical_flag at all.

E2 — Methodological value pluralism / non-cognitivism. For normative claims, the instrument treats ultimate value commitments as not truth-apt in the way facts are, and as plural and sometimes incommensurable (Weber 1949; Berlin 1969). This is held methodologically — as a stance the procedure adopts — not as a settled meta-ethics; it is in tension with E1’s realism and with Putnam’s (2002) critique, a tension the design manages rather than dissolves.

E3 — Argument quality as intersubjective and procedural. “Quality” is treated as a real but evaluator-relative property, knowable through shared norms of reasonable argumentation (van Eemeren & Grootendorst 2004) rather than by correspondence to a fact. The epistemology of the four scores is therefore consensus-theoretic and coherentist (Habermas 1984), not foundationalist. This is why scores are defended by reasons (notes_fr), not asserted.

O2 / E4 — The instrument is situated. The rater is a large language model with priors absorbed from training data. There is no neutral judge; the lenses (§6) and the quality/truth firewall (§3) are mitigations of, not escapes from, this situatedness.

8Validity, reliability, and limits

Assessed as a measurement instrument, the design has the strengths and the serious weaknesses one would expect of a single-rater coding scheme implemented by a language model.

Construct validity

Whether the four dimensions in fact constitute “argument quality” is assumed, not demonstrated. In Cronbach & Meehl’s (1955) terms, the instrument has a plausible content validity (the dimensions are recognisable argumentation norms) but no established construct or criterion validity: it has not been calibrated against a human-coded gold standard.

Reliability

There is a single rater, so no inter-rater reliability statistic (e.g. Cohen’s or Krippendorff’s κ) is computable. This is the most important deficiency: the DQI’s credibility rests on demonstrated inter-coder agreement, which this prototype simply lacks. The natural remedy is an ensemble of independently prompted judges with reported agreement and adjudicated disagreement.

LLM-as-judge biases

The “LLM-as-a-judge” literature documents systematic biases — position/order effects, verbosity bias, and self-enhancement bias (Zheng et al. 2023). A language model may also systematically under-rate a register or a tradition under-represented or contested in its training data. The quality/truth firewall and the multi-lens pass reduce, but do not eliminate, these effects.

Measurement and coding

(i) The 0–5 scores are ordinal; the “solidité” mean treats them as interval, which is strictly unjustified and should be read as a heuristic. (ii) Argument type is self-assigned by the producer, not independently coded. (iii) Source verification is automated and imperfect: status labels (verified / partial / unverified / contradicts / inaccessible) are the model’s assessment of whether a fetched page supports the claim, not an audited fact-check; the source tiering (§2) makes the evidence dimension rule-governed rather than impressionistic, but the tier assignment and the support judgment are themselves model decisions.

How it could be hardened

In descending order of leverage: a multi-judge ensemble with reported κ; a human-coded validation set for a sample of arguments; moving type-coding to an independent pass; and pre-registering the rubric and scale anchors. None of these is exotic; each is standard practice in content analysis (Krippendorff 2019) that the prototype has not yet adopted.

9Why this shape, for a student of parties and contention

The design’s commitments map onto familiar results. Treating positions as camps arrayed on cleavages echoes Lipset & Rokkan (1967) and the left–right heuristic as a low-dimensional organiser of political conflict (Bobbio 1996; Downs 1957). The internal_consistency dimension is, in effect, a micro-level analogue of Converse’s (1964) belief-system constraint. The clash graph is a map of frame disputes in the sense of Snow & Benford (1988) and Gamson (1992): the quiproquo category formalises the common finding in contentious-politics research (Tilly & Tarrow 2007; della Porta & Diani 2006) that opposed movements often deploy non-overlapping diagnostic frames and therefore do not truly meet. And the refusal to declare a winner is the methodological correlate of taking value pluralism — and the partisanship of the analyst — seriously.

The wager of the pipeline is that the most useful thing an outside evaluator can offer a divided public is not a verdict but a better-specified disagreement: each side at its strongest, the points of genuine collision named and typed, the empirical questions separated from the value questions, and the analyst’s own lens made visible. Whether the instrument achieves that reliably is, appropriately, an empirical question this note cannot settle.

References

Berlin, I. (1969). Four Essays on Liberty. Oxford University Press.

Bhaskar, R. (1975). A Realist Theory of Science. Leeds Books.

Bobbio, N. (1996). Left and Right: The Significance of a Political Distinction. Polity Press.

Boltanski, L., & Thévenot, L. (2006). On Justification: Economies of Worth (C. Porter, Trans.). Princeton University Press. (Original work published 1991.)

Converse, P. E. (1964). The nature of belief systems in mass publics. In D. Apter (Ed.), Ideology and Discontent (pp. 206–261). Free Press.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

della Porta, D., & Diani, M. (2006). Social Movements: An Introduction (2nd ed.). Blackwell.

Downs, A. (1957). An Economic Theory of Democracy. Harper & Brothers.

van Eemeren, F. H., & Grootendorst, R. (2004). A Systematic Theory of Argumentation: The Pragma-Dialectical Approach. Cambridge University Press.

Gamson, W. A. (1992). Talking Politics. Cambridge University Press.

Habermas, J. (1984). The Theory of Communicative Action, Vol. 1: Reason and the Rationalization of Society (T. McCarthy, Trans.). Beacon Press.

Haraway, D. (1988). Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14(3), 575–599.

Hume, D. (1978). A Treatise of Human Nature (L. A. Selby-Bigge & P. H. Nidditch, Eds.). Clarendon Press. (Original work published 1739–40.)

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology (4th ed.). SAGE.

Lipset, S. M., & Rokkan, S. (1967). Cleavage structures, party systems, and voter alignments: An introduction. In Party Systems and Voter Alignments (pp. 1–64). Free Press.

Mair, P. (1997). Party System Change: Approaches and Interpretations. Oxford University Press.

Mercier, H., & Sperber, D. (2011). Why do humans reason? Arguments for an argumentative theory. Behavioral and Brain Sciences, 34(2), 57–74.

Mill, J. S. (1859). On Liberty. John W. Parker and Son.

Nagel, T. (1986). The View from Nowhere. Oxford University Press.

Putnam, H. (2002). The Collapse of the Fact/Value Dichotomy and Other Essays. Harvard University Press.

Sartori, G. (1976). Parties and Party Systems: A Framework for Analysis. Cambridge University Press.

Snow, D. A., & Benford, R. D. (1988). Ideology, frame resonance, and participant mobilization. International Social Movement Research, 1, 197–217.

Steiner, J., Bächtiger, A., Spörndli, M., & Steenbergen, M. R. (2004). Deliberative Politics in Action: Analyzing Parliamentary Discourse. Cambridge University Press.

Tilly, C., & Tarrow, S. (2007). Contentious Politics. Paradigm Publishers.

Toulmin, S. E. (1958). The Uses of Argument. Cambridge University Press.

Walton, D. (2008). Informal Logic: A Pragmatic Approach (2nd ed.). Cambridge University Press.

Weber, M. (1949). The Methodology of the Social Sciences (E. Shils & H. Finch, Eds. & Trans.). Free Press. (“Objectivity” essay originally published 1904.)

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., … Stoica, I. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36.

Status & provenance. Prototype methodology, generated by the AI system that runs the pipeline. Citations situate the design within existing scholarship and are not claims of empirical validation of this system. The pipeline applies to French political debates at debats-fr.vercel.app; this note describes the assessment stage only. Compiled 2026-06-02; revised 2026-06-03 to document the typed source-tier scoring of the evidence dimension. English original; the canonical French version is at methodologie.html.