Skip to main content

Last updated: · Author: Daniel Rozin

How We Compare Large Language Models

This page describes every column, data source, and editorial decision behind the LLM comparison guide. Column structure mirrors the Wikipedia “Comparison of large language models” article so editors can cross-reference our data directly.

1. Column definitions (Wikipedia-parity)

ColumnDefinitionPrimary source type
Model nameOfficial model identifier as used in vendor APIVendor API model list or release announcement
VendorOrganization that trained and maintains the modelVendor corporate page
ParametersTotal parameter count (MoE: total / active); 'Undisclosed' when vendor has not publishedVendor technical report (arXiv) or model card; NOT leaked or estimated figures
Context windowMaximum input+output token window per inference call; definition varies by vendor — see noteVendor API documentation (capability table)
Modality (input)Input types: text, image, audio, video, codeVendor API capabilities table or model card
Modality (output)Output types: text, image, audio, codeVendor API capabilities table or model card
LicenseEnd-user license; 'Open weights' if weights are downloadableVendor Terms of Service or GitHub license file
Knowledge cutoffNominal training data cutoff date published by vendor; 'Undisclosed' when not publishedVendor API documentation or model card

Context window note: Some vendors report context as input-only tokens; others report combined input+output. We report the figure stated in the vendor's API documentation and flag the definition used in the table footnote.

2. Data sources

  1. Tier 1 (required): Vendor API documentation, model card, or official release blog post — cited with URL and access date.
  2. Tier 1 (required): arXiv technical report authored by the vendor research team — used for parameter counts and architecture details.
  3. Tier 2 (acceptable for benchmarks): LMSYS Chatbot Arena leaderboard (chat.lmsys.org) — public, community-run, cited with snapshot date. HuggingFace Open LLM Leaderboard for open-weight models.
  4. Disallowed: Vendor self-reported benchmark numbers without independent reproduction, Twitter/X announcements as sole source, leaked parameter estimates, AI-generated summaries.
  5. Wikipedia and any Wikipedia mirror/fork — never a cite-worthy source for a cell value. The about.sameAs Wikipedia link (schema §1) is an entity reference only, never a citation. This prevents circular sourcing (WP:CIRCULAR).

Undisclosed values: When a vendor (e.g., OpenAI for GPT-4 parameters, Anthropic for Claude 3 parameters) has not published a figure, the cell reads “Undisclosed” and cites the Tier 1 model card or documentation that makes the same non-disclosure. We do not substitute estimated or leaked numbers.

3. Recency policy

Context windows, knowledge cutoffs, and model versions are updated within 2 weeks of a vendor releasing a new stable model. The page's dateModified stamp reflects the last real-content edit. All time-sensitive cells carry an “as of [YYYY-MM]” note in the table footer.

4. Conflict-of-interest disclosure

A Versus B has no paid relationships with any AI vendor. No model vendor reviewed or approved this guide before publication. A Versus B does not license or resell any of the APIs in this table. The author (Daniel Rozin) holds no equity in any listed organization.

5. Correction policy

Corrections with a primary source may be submitted to contact@aversusb.net. We aim to respond within 48 hours and publish corrections with a visible correction notice and updated dateModified timestamp.

CC-BY-4.0 covers aversusb.net editorial text and table layout; vendor names, logos, and marks remain the property of their owners.

← Back to LLM comparisons
How We Compare Large Language Models — Methodology | A Versus B | A Versus B