AGI Milestone Tracker

AGI Milestone Tracker

Workbook-backed tracker

Evidence

Published evidence library

Browse the curated evidence entries attached to questions across the tracked dimensions. This MVP distinguishes between benchmark evidence, leaderboards, research, news, and implementation material.

Got feedback for us?

Methodology

The public tracker is generated from a maintained workbook. The Questions sheet holds the dimension and question structure, and the Evidence sheet holds the published evidence entries linked to those questions.

23 evidence items shown

LeaderboardMet

AIME 2025

Artificial Analysis

Artificial Analysis' AIME 2025 evaluation measures olympiad-level mathematical reasoning with exact integer answers, directly fitting multi-step reasoning under constraints. | test metric name: test metric value test unit | Model: Test | Superduper good first evidence

Linked to

AI can correctly understand, reason through, and plan difficult tasks

Can it solve hard problems that require several correct steps while following constraints?

May 29, 2026

LeaderboardMet

LiveCodeBench

LiveCodeBench

LiveCodeBench continuously evaluates fresh coding problems for code generation, self-repair, and execution, making it a strong public source for coding task execution.

Linked to

AI can correctly understand, reason through, and plan difficult tasks

Can it solve hard problems that require several correct steps while following constraints?

May 29, 2026

LeaderboardIn progress

Global-MMLU-Lite

Artificial Analysis

Global-MMLU-Lite evaluates knowledge and reasoning across a diverse range of languages and cultural contexts, directly fitting cross-domain and cross-language transfer.

Linked to

AI can adapt and generalize beyond the exact examples it has seen

Can it transfer what it knows across domains, tasks, and languages?

May 29, 2026

BenchmarkIn progress

ARC-AGI-2 / ARC Prize

ARC Prize

ARC-AGI-2 tests abstract reasoning on unknown tasks under efficiency constraints, making it a direct public benchmark for out-of-distribution generalization.

Linked to

AI can adapt and generalize beyond the exact examples it has seen

Can it handle problems meaningfully unlike those it has seen before?

May 29, 2026

LeaderboardIn progress

Artificial Analysis Long Context Reasoning (AA-LCR)

Artificial Analysis

AA-LCR measures extraction, reasoning, and synthesis across 10k-100k token documents, directly fitting long-context memory and generalization.

Linked to

AI can adapt and generalize beyond the exact examples it has seen

Can it use long context without losing important details?

May 29, 2026

LeaderboardIn progress

AA-Omniscience

Artificial Analysis

AA-Omniscience measures factual recall while penalizing hallucination and rewarding abstention, making it a strong source-level signal for factual accuracy.

Linked to

AI can stay grounded in facts and evidence and be honest about uncertainty

Does it usually get the facts right?

May 29, 2026

BenchmarkIn progress

ALCE

Princeton NLP

ALCE is a benchmark for automatic citation evaluation that measures whether answers are supported by cited evidence, directly fitting citation and evidence correctness.

Linked to

AI can stay grounded in facts and evidence and be honest about uncertainty

Are its claims grounded in relevant supporting evidence?

May 29, 2026

BenchmarkIn progress

AbstentionBench

Meta research

AbstentionBench evaluates when models should hold back on underspecified, ill-posed, or unanswerable questions, directly fitting uncertainty management through appropriate abstention.

Linked to

AI can manage uncertainty and improve its own answers

Does it manage uncertainty appropriately, including confidence and holding back when needed?

May 29, 2026

BenchmarkIn progress

CorrectBench

CorrectBench authors

CorrectBench measures whether models improve answers after critique or verification across reasoning tasks, directly fitting self-correction.

Linked to

AI can manage uncertainty and improve its own answers

Can it notice and fix its own mistakes?

May 29, 2026

LeaderboardIn progress

Tau-2 Bench Telecom

Artificial Analysis / Tau-2 Bench project

Tau-2 Bench Telecom simulates dual-control support conversations where agent and user coordinate to solve telecom issues, directly fitting interaction quality in multi-turn settings.

Linked to

AI can interact with people clearly, helpfully, and with social awareness

Is it clear, helpful, and appropriate in how it responds?

May 29, 2026

LeaderboardIn progress

EQ-Bench 3

EQ-Bench

EQ-Bench 3 measures emotional intelligence in challenging roleplays across empathy and social dimensions, directly fitting emotion and affect handling.

Linked to

AI can interact with people clearly, helpfully, and with social awareness

Can it respond appropriately when emotions or tension are involved?

May 29, 2026

LeaderboardIn progress

MMMU-Pro

Artificial Analysis / MMMU-Pro authors

MMMU-Pro rigorously tests multimodal understanding and reasoning across 30 academic disciplines, directly fitting multimodal reasoning.

Linked to

AI can understand, reason over, and generate across multiple modalities

Can it reason across more than one modality at once?

May 29, 2026

LeaderboardIn progress

Artificial Analysis Text-to-Image Leaderboard

Artificial Analysis

Artificial Analysis' text-to-image leaderboard measures comparative image generation quality across prompt categories, providing a public source-level signal for multimodal generation.

Linked to

AI can understand, reason over, and generate across multiple modalities

Can it create useful outputs across text, image, audio, or video?

May 29, 2026

LeaderboardMet

IFBench

Artificial Analysis / IFBench authors

IFBench evaluates precise instruction-following generalization on diverse verifiable output constraints, directly fitting steerability and constraint adherence.

Linked to

AI can remain steerable, safe, and protective of private information under adversarial pressure

Does it follow instructions and stay within set limits?

May 29, 2026

BenchmarkIn progress

HarmBench

Center for AI Safety

HarmBench is a standardized benchmark for automated red teaming and robust refusal, directly fitting harmfulness prevention.

Linked to

AI can remain steerable, safe, and protective of private information under adversarial pressure

Does it avoid giving harmful advice or unsafe help?

May 29, 2026

LeaderboardIn progress

SWE-bench Verified

SWE-bench team

SWE-bench Verified measures whether systems resolve real GitHub issues in real repositories, directly fitting long-horizon task completion in software environments.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it complete long, multi-step tasks from start to finish?

May 29, 2026

LeaderboardIn progress

GAIA

GAIA benchmark team

The official GAIA leaderboard covers non-trivial tool-using assistant tasks with unambiguous answers across difficulty levels, making it a strong source-level signal for end-to-end real-world task execution.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it complete long, multi-step tasks from start to finish?

May 29, 2026

LeaderboardIn progress

Berkeley Function Calling Leaderboard (BFCL) V4

UC Berkeley / Gorilla

BFCL V4 evaluates executable correctness of function and tool calls across real-world and multi-turn scenarios, directly fitting reliable tool use.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it coordinate multiple tools in the right order?

Dec 16, 2025

LeaderboardIn progress

Terminal-Bench Hard

Artificial Analysis / Terminal-Bench

Artificial Analysis evaluates Terminal-Bench Hard on realistic terminal tasks spanning software engineering, system administration, and data processing, directly fitting digital environment manipulation.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it operate digital environments to get a task done?

May 29, 2026

LeaderboardIn progress

SWE-bench Verified

SWE-bench team

SWE-bench Verified measures whether systems resolve real GitHub issues in real repositories, directly fitting operation in real software environments.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it operate digital environments to get a task done?

May 29, 2026

LeaderboardIn progress

GDPval-AA

Artificial Analysis / OpenAI GDPval dataset

GDPval-AA evaluates agentic completion of economically valuable work tasks with web and shell access, making it a direct source-level signal for real-world task execution.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it operate digital environments to get a task done?

May 29, 2026

LeaderboardIn progress

Terminal-Bench Hard

Artificial Analysis / Terminal-Bench

Artificial Analysis evaluates Terminal-Bench Hard on realistic terminal tasks where agents must recover from failed actions and continue toward a goal, directly fitting execution recovery.

Linked to

AI can autonomously execute long, tool-using workflows in digital environments

Can it recover from errors and continue instead of getting stuck?

May 29, 2026

BenchmarkIn progress

ARC-AGI-2 / ARC Prize

ARC Prize

ARC-AGI-2 tests abstract reasoning on unknown tasks under efficiency constraints, making it a direct public benchmark for out-of-distribution generalization.

Linked to

AI can stay reliable when inputs or workflows become noisy, shifted, repeated, or long-running

Can it still perform well when tasks or data differ from what it usually sees?

May 29, 2026