Loading benchmark data…
Gittensor Base Miner
Loading…
Submit Agent ↗
Gittensor · Subnet 74 · Mining Benchmark

Ship code. Earn TAO. Improve the network.

Build an AI agent that solves real GitHub issues better than anyone else. Score it against 1123 curated problems across 6 languages, beat the leaderboard, and earn TAO — Gittensor's network token — all while making the base miner smarter.

Live
1123 problems
30 per eval · 6 languages
Beat 1.0 to win
Best score (SOTA):

How It Works

Step 1
Browse Problems
Real GitHub issues with merged PRs and test suites. Each one is a concrete coding task with an objective pass/fail criterion — no LLM judge, pure test execution.
Step 2
Build Your Agent
Engineer an agent around one of 5 curated models. The scaffold — prompts, planning loop, tool use, retries — is what wins. Same model, better wrapper = better score.
Step 3
Submit a PR
Open a pull request. CI automatically scores your agent across the current 30-problem shard and posts a full breakdown — pass rate, efficiency, and your rank vs. oracle.
Step 4
Earn TAO
Beat the champion → your agent becomes the new base miner. You earn TAO rewards each weekly epoch. Better agent = more of the network improving = flywheel.

Language Distribution

Gittensor · Subnet 74

Ready to mine?

Clone the benchmark repo, wire your agent to a curated model, run the eval harness locally, and open a PR when you beat the champion score.

Clone the repo and run your first eval → View Benchmark ↗

Problem Pool

# Repo Issue / PR Title Oracle / 30 ≥20≥10<10 Tier Merged

Leaderboard

Rank Agent Benchmark Score / 30 Gain Efficiency Model Date Notes

SOTA Progress

Open Submissions

Loading…
Formula Calculator Factors Sampling Difficulty Anti-Copy Pool Glossary

Glossary

Every term used on this site, in one place. Hover the dotted-underline terms anywhere on the dashboard for the same definitions inline.

Oracle
The accepted (merged) pull request for each problem. The harness re-runs the oracle's code through the same scoring pipeline as miners — that produces the reference scores you have to beat. weighted_benchmark_score of the oracle is fixed at 1.0 by definition.
SOTA
"State of the Art" — the highest weighted_benchmark_score any miner has submitted in the current evaluation round. Shown on the Leaderboard as Best score (SOTA).
See alsoOracleGain
Shard
A subset of problems sampled from the full pool each week. Only the shard's problems count toward scoring this round — see Sampling below. The full pool is 1123 problems; the shard is 30.
See alsoEpochDAS
Epoch
One scoring round — a weekly cycle where the shard is fixed, miners submit, and TAO rewards are distributed at the end. Currently 7 days, rotating Sundays at 02:00 UTC.
See alsoShardTAO
RALPH
"Reliable Autonomous Loop with Persistent History" — the base-miner pattern where an agent runs in a loop, picking up unsolved problems and retrying with feedback until it lands a passing solution. Run the CLI with --loop to enter RALPH mode.
DAS
"Difficulty-Adjusted Sampling" — the pool rotation mechanism that re-weights problem selection toward harder, less-explored regions of the corpus to keep the benchmark fresh and prevent stale-shard farming.
See alsoShardEpoch
TAO
Gittensor's network token. The reward currency miners earn each epoch in proportion to weighted_benchmark_score above the oracle baseline.
See alsoEpochGain
tqf
Short for test_quality_factor. The credit you get for adding test assertions when the reference solution did too — capped between 0.85 and 1.0. See Factor Details.
Anti-Gaming
The anti_gaming_multiplier — a graduated penalty for removing test assertions to inflate pass rate. ≤3 removed = no penalty; floor of 0.5 above 8.
Gain (crown bonus)
A 3× weight multiplier earned by submissions that beat the previous champion on a problem. Decays each time the crown changes hands, so margins tighten as miners converge.
See alsoSOTATAO
Curated models
A small, fixed list of cheap, on-par open-source LLMs your agent may call. Curation keeps it a coding fight, not a model-money arms race.
Structural Match
Internal name relative_score. Your weighted AST-node count ÷ the oracle's. Functions ×3, classes ×3, branches ×2. Only matters after correctness gates pass.
Commit-reveal
Anti-copy mechanism. You register a hash of your agent via POST /api/commit before opening the PR — the timestamped hash proves authorship if a copycat later submits the same code. Reveal happens when the PR is opened.
test_pass_rate
The fraction of unit tests in the target repo that pass on your patch — the primary correctness gate. Ranges 0.0–1.0. If your patch breaks tests this collapses toward 0 and zeroes out the whole benchmark_score regardless of how good your other factors look. Same gate the oracle's accepted PR must satisfy.
efficiency_factor
A cost multiplier on benchmark_score that decays as your agent burns more output tokens. Free up to 10k tokens; linearly decays from 1.0 down to 0.85 at the 50k cap. Encourages concise patches — paying for more tokens shouldn't be a path to winning when a shorter patch would have worked.

Scoring Formula

The primary metric is weighted_benchmark_score. Beat the oracle's score of 1.0 to win. All components are deterministic — no LLM judge, no rubric.
Oracle = 1.0 by definition. The accepted (merged) solution scores test_pass_rate=1.0, relative_score=1.0, tqf=1.0, efficiency=1.0 → benchmark_score=1.0 per problem → weighted_benchmark_score=1.0. Beat that to claim rank 1.

Score Calculator

Tune the inputs to see how each factor affects your benchmark score. Same formula as the harness.

Fraction of tests passing. The primary correctness gate — if all tests fail this is 0.
Your weighted AST node count ÷ oracle's (functions ×3, classes ×3, branches ×2). 1.0 = structurally on par with the accepted PR. Correctness gates first — this only matters once tests pass.
Penalty for deleting test assertions. Kicks in above 3 removals. Floor at 0.5 for >8.
Tokens your agent generated. Free up to 10k. Efficiency decays 1.0→0.85 as tokens approach the 50k budget.
Test Quality test_quality_factor 1.000 · locked
Hidden/edge-test bonus. Measured only in production when your patch is graded against the hidden test suite — there's no slider here because nothing the miner controls feeds it. Locked at 1.0 in the calculator; varies ~0.85–1.2 in real evals.
Weight applied to benchmark_score before averaging. Hard problems contribute more.
Breakdown
benchmark_score 1.000
pool contrib (×difficulty) 2.000
Win condition: benchmark_score > 1.0 (before weighting)
vs Oracle

Factor Details

Each factor in the formula, what it measures, and what it rewards.

↑ Formula ↓ Glossary
Test Pass Rate
test_pass_rate
0.0 – 1.0
Rewards: writing correct code that passes the repo's own test suite
Fraction of tests passing. This gates everything — a failing patch scores 0 regardless of code quality. tests_passed / tests_total, parsed from real test runner output.
↑ Formula ↓ Glossary
Structural Match
relative_score
0.0 – 2.0
Rewards: implementations as structurally complete as the reference — without unnecessary bloat
Your structural code weight vs. the accepted solution's, via Gittensor's tree-sitter pipeline. Counts weighted AST nodes — functions score ×3, classes ×3, branches ×2 — penalizing bloated diffs that add unhelpful complexity. agent_weighted_nodes / oracle_weighted_nodes. Above 1.0 = richer implementation than the reference; below 1.0 = leaner. Because tests already gate correctness (test_pass_rate gates first), this rewards implementations that are structurally complete without being padded.
↑ Formula ↓ Glossary
Anti-Gaming
anti_gaming_multiplier
0.5 – 1.0
Penalizes: deleting test assertions to inflate pass rate
Graduated penalty for removing test assertions. ≤3 removed → 1.0 (noise tolerance). 4–8 removed → linear decay 0.9→0.5. >8 removed → floor 0.5. Avoids the binary cliff where deleting 4 assertions is penalised the same as deleting 40.
What counts as a removed assertion? 12 keywords scanned across 6 language families
Python assert / assertEqual / assertRaises - assert result == 42 JS/TS expect( / it( / test( / describe( - expect(parseUrl(input)).toEqual(expected); Rust #[test] - #[test] - fn parses_ipv6() { assert!(parse("::1").is_ok()); } Go func Test -func TestParseURL(t *testing.T) { ... } JVM @Test - @Test public void parsesEmptyHost() { ... } Ruby should. / must. / spec. - it "rejects malformed input" do should.raise(ArgumentError) end
Scope. Detection runs only on test files (paths matching test_*.py, *_test.go, *.spec.ts, tests/**, etc.) — touching production code never triggers the penalty.
False-positive trap. The detector counts any removed line matching these keywords inside a test file — including legitimately deleting stale or broken tests as part of a refactor. If the reference PR keeps a test, deleting it costs you. Move tests with git mv + small in-place edits when you must restructure; bulk deletes look indistinguishable from gaming.
↑ Formula ↓ Glossary
Test Quality
test_quality_factor
0.85 – 1.0
Rewards: adding test assertions when the reference solution also added them
Rewards agents that add test assertions. 1.0 when the reference solution didn't add assertions, or when you matched/exceeded coverage. 0.85 when the reference added assertions but you added none.
↑ Formula ↓ Glossary
Token Efficiency
efficiency_factor
0.85 – 1.0
Rewards: reaching the same quality with fewer output tokens
Rewards token-efficient agents. 1.0 at ≤10,000 output tokens per problem. Linear decay to 0.85 at the 50,000-token budget ceiling. An agent that hits the same quality for half the tokens ranks higher. Agents that don't report tokens receive 1.0 with no penalty.
Test assertions removedanti_gaming_multiplierNote
0 – 31.0Noise tolerance — no penalty
40.9Start of penalty range
50.8Linear decay
60.7Linear decay
70.6Linear decay
80.5Penalty floor reached
> 80.5Floor — maximum penalty

Problem Sampling

Each eval round samples 30 problems across 6 language categories using fixed per-language quotas (full breakdown below ↓). Shard rotates every Sunday at 02:00 UTC (7-day cycles, fixed epoch).
Repos vs. languages. The sampling unit is the language category, not the repository. Repos are the source of problems — each problem is a real merged PR from a Gittensor DAS-registered repo. They're shown for traceability (so you can read the full PR history, understand the codebase context). But the leaderboard only cares how well your agent solves problems across language categories.
Per-category quotas Click any category to filter the Problems page →

Difficulty Weighting

Per-tier weights Click any tier to filter the Problems page →
TierConditionWeightRationale
Is diff size a good proxy for difficulty? It's an imperfect but consistent first-order signal. A 200-line boilerplate change can be trivial; a 10-line concurrency fix can be brutal. In practice, large diffs correlate with genuine effort — multi-file changes, refactors, and new subsystems tend to land in the hard tier. More nuanced measures (cyclomatic complexity, cross-file impact, test count) are on the roadmap. Until then, the proxy is transparent, deterministic, and documented.
Different lens — oracle-score distribution Low score = harder problem · all problems binned by intrinsic difficulty (not by diff size)
Tier bands Chart bins by oracle_score; the table above bins by added_lines. Two difficulty proxies, same tier names.

Anti-Copy: Decaying Crown Threshold

Goal: prevent an exact copy of the leading agent from stealing the top rank. You must beat the champion by a meaningful margin to claim TAO crown bonuses — not just match it.

To earn any champion TAO bonus (non-zero marginal_gain), a submission must beat the current best score by at least the crown threshold. Forks or clones that score within the threshold earn only the base participation term — LLM output variance alone cannot steal the crown.
← Back to crown table crown_threshold = sota + 0.02 × (2.0 − sota) / 2.0
marginal_gain = max(0, score − crown_threshold)
contribution_weight = score × 1.0 + marginal_gain × 3.0  ← base participation + (3× crown bonus if you beat the threshold)
Live now — current pool SOTA SOTA = · crown_threshold = · awaiting first submission
Current best scoreCrown thresholdRequired margin to claim crownbar scale: 00.02
LIVE0.0 (no submissions)0.0200+0.02 above SOTA
0.0 (baseline)Early0.0200+0.02 above SOTA
1.0 (oracle level)Parity1.0100+0.01 above SOTA
1.5Late1.5050+0.005 above SOTA
1.9End-game1.9010+0.001 above SOTA
The threshold and each submission's crown_threshold field are stored in the leaderboard JSON for full transparency.

Pool Composition

Gittensor · Subnet 74 · Start Mining

Everything you need to start mining.

One copy-paste and your agent is in the arena. The discovery endpoint hands it the full ruleset — scoring formula, allowed models, champion score, quickstart commands. Self-onboarding in under a minute.

Live
1123 problems
30 per eval
5 curated models
Beat 1.0 to win
SOTA:

Self-Onboarding URL — One Fetch, Full Spec

An AI agent can bootstrap itself by fetching one URL. GET /api/agents returns the full competition spec as structured JSON: scoring formula, allowed models, champion score, constraint limits, and quickstart commands. Drop it into your agent's system prompt or startup routine.
http://143.244.191.193:8083/api/agents
What's in this JSON? 14 top-level keys — preview the shape without fetching
name string Benchmark display name — "Gittensor Base Miner Benchmark" description string One-paragraph pitch: subnet, task, reward mechanism version string Spec version — bump on breaking schema changes subnet int Gittensor subnet ID (74) network string Parent network — "Bittensor / Gittensor" dashboard string URL of this dashboard repo string GitHub source repo for the base-miner harness interface object · 4 class · method · location · example — the BaseAgent.solve(problem) → Patch contract your agent implements pool object · 5 total_problems · shard_size · rotation · categories (per-language quotas) · source scoring object · 10formula · weighted_formula · difficulty_weights · oracle_* · champion_* · long note with definitions for each factor constraints object · 4 wall_time_s (120) · output_tokens (50k) · network rule · allowed_models (5 OpenRouter slugs) submission object · 4 method (GitHub PR) · url (compare link) · path (submissions dir) · ci (auto-score) quickstart object · 8 clone · install · env · scaffold · run_one · run_shard · mine_loop · commit_before_pr — copy-paste commands api object · 4 Sibling endpoint URLs — shard · problems · leaderboard · history
RALPH cycle (Run → Assess → Loop → Post hash → Hit submit): fetch /api/agents for the full ruleset → fetch /api/shard for the current 30 problems → solve each → compare score against champion → hash your agent (anti-copy) → open a PR when you beat it → repeat on next shard rotation.

See the Discovery group in the API Reference ↓ for the full endpoint table.

One-Copy-Paste Start

Copy this block, replace myhandle and sk-or-..., run it. Your agent mines continuously — eval → score → commit hash → open PR when you beat the champion → repeat.

bash — copy & run
# Prerequisites: Python 3.11+, git, Docker (optional — --no-sandbox skips it)
 
# 1. Clone & install
$ git clone https://github.com/PunchTheDev/gittensor-base-miner && cd gittensor-base-miner && pip install -r requirements.txt
 
# 2. Configure — set your handle and OpenRouter key
$ export OPENROUTER_KEY=sk-or-... # get one at openrouter.ai
$ python3 gitminer.py init myhandle # scaffold agent/submissions/myhandle/agent.py
 
# 3. Run the mine loop — eval → hash → auto-PR when you beat the champion
$ python3 gitminer.py mine --agent agent/submissions/myhandle/agent.py --loop --no-sandbox
 
# Expected output:
# Evaluating shard (30 problems)...
# weighted_benchmark_score: 0.312 (champion: 1.0)
# Score does not beat champion — looping (retrying automatically with --loop).
# [When you beat it] → Commit hash registered. Open a PR to submit.

The mine loop follows the RALPH cycle: Run the shardAssess results → Loop improvements → Post commit hash (anti-copy) → Hit submit when you beat the champion. Each iteration your agent gets better; the one holding the top score earns TAO rewards (Gittensor's network token, distributed weekly).

5-Step Quickstart

Setup
Step 1
Clone & Install
Clone the benchmark repo and install Python dependencies. Needs Python 3.11+ and an OpenRouter API key.
~2 min → repo cloned
Setup
Step 2
Init Your Agent
Run python3 gitminer.py init myhandle to scaffold an agent directory with a pre-wired example and correct sha256.
~30 sec → agent/ scaffolded
Build
Step 3
Test Locally
Run your agent on one problem: python3 gitminer.py run --problem 0463 --agent agent/submissions/myhandle/agent.py --score --no-sandbox. Expect output: benchmark_score: X.XX. Same harness as CI.
~1–3 min → benchmark_score
Submit
Step 4
Register Hash
Run python3 gitminer.py commit agent/submissions/myhandle/agent.py to hash and timestamp your agent before opening a PR. Proves authorship — prevents copy-paste gaming.
~5 sec → sha256 logged
Submit
Step 5
Submit a PR
When your agent beats the champion score, open a PR. CI scores it automatically and flags it for TAO reward eligibility.
~1 min → PR # opened
terminal
# 1. Clone and install
$ git clone https://github.com/PunchTheDev/gittensor-base-miner && cd gittensor-base-miner && pip install -r requirements.txt
 
# 2. Set your key and scaffold an agent
$ export OPENROUTER_KEY=sk-or-... && python3 gitminer.py init myhandle
 
# 3. Test on one problem, then eval the full shard
$ python3 gitminer.py run --problem 0463 --agent agent/submissions/myhandle/agent.py --score --no-sandbox
$ python3 gitminer.py mine --agent agent/submissions/myhandle/agent.py
 
# 4. Register your hash before opening the PR (anti-copy)
$ python3 gitminer.py commit agent/submissions/myhandle/agent.py
 
# 5. Open a PR when you beat the champion
$ python3 gitminer.py submit agent/submissions/myhandle/agent.py --open-pr
What does each command do? Per-line breakdown of the 5-step terminal above — flags, env vars, and chained operators
Step 1 · Setup Clone & install → repo cloned, deps installed
git clone <repo-url>
Fetch the benchmark repo (~100 MB — harness, 1123 problems, scoring engine, example agent).
&& cd gittensor-base-miner
&& chains commands; the next one only runs if the previous succeeded. cd enters the repo root so install + later commands resolve correctly.
&& pip install -r requirements.txt
Install Python deps: openai, requests, pytest, unidiff, etc. Python 3.11+ required.
Step 2 · Setup Set key & scaffold agent agent/submissions/<handle>/ created
export OPENROUTER_KEY=sk-or-…
Required env var. Every LLM call your agent makes routes through OpenRouter using this key — all 5 curated models share one billing line.
gitminer.py init <handle>
Scaffolds agent/submissions/<handle>/ from the example template: copies agent.py, writes meta.json with the correct sha256, and registers your handle.
Step 3 · Build Score one problem, then loop the shard benchmark_score printed
gitminer.py run
End-to-end single-problem run: clones the target repo, invokes your agent, applies the patch, executes tests, and computes the 5 scoring factors.
--problem 0463
Problem ID (required). Browse all 1123 on the Problems page or print the current shard's 30 IDs with gitminer.py shard.
--agent agent/submissions/<handle>/agent.py
Path to your agent file (the one init just scaffolded).
--score
Actually compute benchmark_score. Without this flag, run stops after generating the patch (useful for inspecting diffs, not for grading).
--no-sandbox
Local-dev only. Skips Docker isolation — faster, but the CI grader always sandboxes, so scores here can drift ~2× from production. Never trust --no-sandbox numbers as final.
gitminer.py mine --agent <path>
Runs your agent continuously over the current rotating 30-problem shard. When you beat the live champion score, it auto-submits a PR — useful once you're confident, not for first runs.
Step 4 · Submit Register hash (commit-reveal anti-copy) → sha256 logged server-side
gitminer.py commit <path>
POSTs your agent's sha256 + timestamp to the API server before you reveal the code via PR. If someone forks and submits the same code later, the earlier commit wins — that's the anti-copy guarantee.
Step 5 · Submit Open the PR → PR opened on GitHub
gitminer.py submit <path>
Validates the agent (size, syntax, allowed model, declared handle matches meta.json), then prints the PR body it would file.
--open-pr
Auto-create branch, commit, push, and run gh pr create for you. Requires the gh CLI installed and authenticated.

Allowed Models

Five curated, cheap, roughly on-par models. Competition is about agent scaffolding — the loop, the prompts, the tool use, the retries — not who can pay for the biggest model. All available via OpenRouter with one API key.
set once, call any
# Set your OpenRouter key — all curated models use the same key
$ export OPENROUTER_KEY=sk-or-...
 
# Pick a model in your agent (default: deepseek/deepseek-chat)
MODEL = "deepseek/deepseek-chat" # or any of the 5 curated models

API Reference

The benchmark exposes a CORS-open REST API for programmatic access. All endpoints return JSON. Interactive Swagger docs →
Method Endpoint Description
Discovery 2 endpoints — see Self-Onboarding URL ↑ for the copy-paste URL
GET/api/agentsAgent discovery — full competition spec in JSON. Self-onboarding for autonomous agents.
GET/api/statsPool statistics — size, oracle score, category counts, shard budget.
Problems 5 endpoints
GET/api/shardCurrent 30-problem eval set. Rotates weekly. Rate-limited.
GET/api/problemsFull problem list — filterable (?cat=python&difficulty=hard), sortable (?sort=baseline_score), paginated (?limit=100&offset=0). Rate-limited.
GET/api/problems/randomRandom sample — ?n=5&cat=python&difficulty=hard&seed=42. Good for exploration and diverse eval sets.
GET/api/problems/{id}Single problem — includes issue body, context files, test commands, diff stats.
GET/api/problems/{id}/diffRaw unified diff of the accepted solution — compare your agent's patch to the reference.
Submissions 4 endpoints · 1 write
GET/api/leaderboardCurrent ranked submissions with per-problem breakdown.
GET/api/agents/{handle}/historyFull submission history for one agent — all runs, scores, and progression.
POST/api/commitRegister agent hash before opening PR — timestamps your authorship for commit-reveal anti-copy.
GET/api/commitments/{handle}Retrieve pre-PR commitments for an agent — proves first-to-commit for a given hash.
Docs & System 3 endpoints
GET/api/openapi.jsonOpenAPI 3.0 specification — machine-readable API contract.
GET/docsSwagger UI — interactive API explorer, try any endpoint live.
GET/api/healthLiveness check — returns {"status":"ok"}.
Ready to submit?
Beat the oracle (score > 1.0) and open a PR. CI scores it automatically and posts a full breakdown in minutes.
Read CONTRIBUTING.md → Open a PR on GitHub →