Grid Simulator Mechanics

Welcome to the Watt-The-Hack Advanced Track docs. This is the complete reference for the grid components, financial constraints, scoring, event mechanics, and the API your controller talks to. Read the Quickstart first, then dip into whichever section you need.

One step

15 simulated minutes

A scenario run

288 steps (72 h / 3 days)

Your job

Meet demand at the lowest cost

The control loop, in one breath

Every 15 simulated minutes the engine hands your controller a state dictionary (current demand, solar, price, battery charge, a short forecast, and any active alerts). You return an action dictionary saying how to dispatch the battery, diesel, solar curtailment and grid reserves. The engine applies physics + market rules, charges you the cost of that step, and loops. Your score is the total cost over the whole run, lower is better.

Quickstart

From zero to a scored submission in four moves. Each links to the section with the detail.

pip install "watt-the-hack[playtest]"
The same engine the judges run. Develop on your own machine with full Python. Imports, LLM calls, anything.
A function controller(state) or a Strategy class with a step(self, state) method, returning an action dict. Start from a template in the Submission Portal.
python -m watt_the_hack.playtest my_controller.py \
  --scenario duck_curve --open-report
The HTML report tells you exactly where your money went.
Paste your code into the Submission Portal, pick a scenario, hit submit. No zipping. Attempts are limited. Playtest first.

Grid Components

The simulation models a single-node city grid. Your controller sits right at the center, balancing generation and demand.

City Demand

The baseline electricity consumption of the city (in MW). Your primary directive is to meet this demand at all times. Failing to do so triggers severe blackout penalties.

Solar Generation

Renewable energy generated by the sun (in MW). It's essentially free energy! If solar exceeds the city's demand, the excess is exported or used to charge your battery. You can choose to curtail (disconnect) the solar if you're hitting the grid's export caps.

Battery & Inverter

Stores excess energy. The battery is bound by two constraints: its Capacity (default 100 MWh) (how much it holds) and its Inverter Limit (default 50 MW) (the maximum rate it can charge/discharge per step).

Diesel Generator

An expensive emergency backup generator. You can manually dispatch it up to its MW limit to prevent blackouts when solar, battery, and grid imports are insufficient.

External Grid

The city's connection to the broader energy market. Import and export happen automatically to balance the grid. You never set them directly. Whatever your dispatch doesn't cover is imported (at state["price"]); any surplus is exported (at the export tariff). Both respect strict Import Caps (default 120 MW) and Export Caps (default 50 MW).

FCAS Market

Frequency Control Ancillary Services. A paid standby market. You bid fcas_reserve_mw = how much inverter capacity you keep on call for the grid, and get paid $40/MW per hour just for holding it ready, whether or not it's ever used. It's a promise, not a discharge. Full mechanics in the FCAS explainer below.

The Inverter Bottleneck

Think of the inverter as the gateway between your battery and the rest of the grid. Even if your battery is completely full, it can't discharge faster than its inverter limit.

Additionally, the inverter limit is shared between active battery_flow_mw and passive fcas_reserve_mw. If you have a 50 MW inverter and you reserve 20 MW for FCAS, you can only charge or discharge your battery at a maximum rate of 30 MW for that timestep. Managing this trade-off is crucial to advanced optimization.

FCAS, explained: it's a bid, not a discharge

What it is in real life: FCAS (Frequency Control Ancillary Services) is a real market run by AEMO, Australia's grid operator. The grid must stay at almost exactly 50 Hz; when a large generator trips, fast assets like batteries are paid to inject or soak up power within seconds to catch the frequency. Crucially, they're paid mostly for being on standby, not for the energy they end up moving. The Hornsdale "Tesla big battery" in South Australia earns a large share of its income from FCAS, not from buying and selling energy. This scenario models that market.

In the game it's a pure bid. Each step you set fcas_reserve_mw = how many MW you promise to keep available for the grid. You move no energy and pay nothing to bid. You're just claiming "I'm holding this much in reserve."

You're paid to hold it: $40 per MW per hour, whether or not it's ever called. Reserve 10 MW for an hour and you earn $400 for doing nothing but staying ready (per 15-minute step that's $10/MW).

It costs you inverter headroom. FCAS gets first claim on the inverter: |battery_flow_mw| + fcas_reserve_mw ≤ inverter limit (default 50 MW). Every MW you reserve is a MW you can't use for arbitrage that step. Changing your bid sharply between steps also costs a small ramp charge (~$500 per MW of change), so keep it steady.

Dispatch events test your claim. Sometimes the grid actually calls your reserve, a fcas_dispatch event asks for a number of MW over a window. You're warned ahead of time: read state["fcas_events_upcoming"] (each entry has at_step, end_step, magnitude_mw). When called you must actually deliver from stored charge, so a reserve bid is only as good as the battery SOC backing it.

⚠ Failing a dispatch is the harshest penalty in the game

  • Deliver the called MW (from your reserve + enough SOC) → a $200/MWh bonus.
  • Fall short. You didn't reserve enough, or your battery is too empty to back it → $100,000 per MWh of shortfall. Example: bid 10 MW, get called for 10 MW over one 15-minute step, deliver 0 → 10 MW × 0.25 h × $100,000 = $250,000 in a single step.

So FCAS is reliable, near-free income. But only bid what you can truly back with charge when a dispatch lands, and keep some SOC in the tank ahead of the windows the engine warns you about.

Costs & Penalties

Every step accrues cost. The sum over the run is your raw score (lower is better; a negative raw score means you turned a profit). Here's every lever that moves it.

Dynamic Import Tariff

What it costs to import power from the external grid (which happens automatically to cover any shortfall). It fluctuates wildly with time of day and market conditions. Read it via state["price"]. The goal is simple: buy low, avoid buying high.

Static Export Tariff

When surplus solar or battery power is exported to the grid (also automatic), you earn this flat rate (default $50/MWh). When price goes negative, exporting costs you. See Negative Prices under Common Pitfalls.

Demand Charge

Your single biggest grid-import spike over the whole run is billed once, at $1,000 per MW of that peak. One careless 120 MW import moment can cost $120,000 on its own. This rewards peak shaving. Pre-charging the battery so you never lean hard on the grid in a single step.

Battery Wear

Cycling the battery isn't free. Every MWh you move through it, charging or discharging, costs $50/MWh in wear, so needless round-trips quietly eat your score. Separately, a ramp charge penalises sudden swings in your net grid power between steps (≈ $1 × ΔMW², so a 30 MW swing ≈ $900). And since the battery is your main lever on the grid draw, jerky dispatch feeds straight into it. Smooth, gradual moves are much cheaper than slamming between extremes.

Carbon Cost

CO₂ from grid imports and diesel is priced at $50/kg. Grid intensity is ≈ 0.7 kg/MWh and diesel ≈ 0.27 kg/MWh by default (scenarios may override grid intensity, exposed as state["grid_co2_intensity"]). Cleaner dispatch is cheaper dispatch.

Blackout Penalty

If demand exceeds everything you can supply (solar + battery discharge + diesel + the grid import cap), the city browns out. This costs $100,000 per MWh of unmet demand. Avoid it at all costs.

Overvoltage Penalty

If (solar + battery discharge) overwhelms (demand + grid export cap), you flood the grid: $5,000 per MWh over the cap. Use curtail_solar or charge the battery to absorb the excess.

Throughput Budget

Some scenarios give you a total battery throughput budget (MWh). Every MWh you charge or discharge permanently consumes it. Read what's left via state["battery_throughput_remaining_mwh"]. Once depleted, the battery is locked. Spend cycles where they matter most.

Cost & penalty reference

Every line that can appear in your cost_breakdown, with the exact rate the engine applies. Positive numbers are costs; green is income. Most rows read $0 unless their mechanic is active in the scenario you're running - the bottom two groups only fire in the advanced scenarios.

Line itemWhat it isRate
Core - charged in every scenario
tariff_importBuying power from the external grid.price × MWh imported - dynamic, read state["price"] ($/MWh)
tariff_exportSelling surplus solar or battery power back to the grid.$50 / MWh exported (income)
demand_chargeYour single biggest import spike, billed once for the whole run (peak-shaving discipline).$1,000 / MW of peak import
carbon_costCarbon price on CO₂ from grid imports + diesel. Scenarios can override grid intensity.$50 / kg CO₂ · grid ≈ 0.7, diesel ≈ 0.27 kg/MWh
battery_wearWear from cycling the battery - charging or discharging.$50 / MWh moved
ramp_chargeSmoothness penalty on the step-to-step change in net grid power.(ΔMW)² × $1
generator_fuelRunning the diesel generator.$1,000 / MWh
blackout_penaltyUnmet demand (load shed). Avoid at all costs.$100,000 / MWh
overvoltage_penaltyExporting beyond the grid export cap.$5,000 / MWh
FCAS - when the scenario enables the reserve market
fcas_revenueAvailability pay for capacity you keep on standby (whether or not it's called).$40 / MW / hour reserved (income)
fcas_dispatch_bonusEnergy you actually deliver when a dispatch event calls your reserve.$200 / MWh delivered (income)
fcas_shortfall_penaltyFailing to deliver a called dispatch - bid only what your SOC can back.$100,000 / MWh short
fcas_ramp_chargeVolatility in your reserve bid between steps.$500 / MW of change
Advanced scenarios only - $0 unless that mechanic is live
compliance_penaltyBreaching an Operator's-Mandate window you opted into (SOC floor / export cap).$2,000,000 / SOC-unit · $500,000 / MW over cap, per step
diesel_ban_penaltyRunning diesel during a ban with no valid agent_plan exemption.$3,000 / MWh
anomaly_ack_fineEach step inside an anomaly window you didn't acknowledge in agent_plan.$5,000 / step
cyber_containment_fineMissing a real attack, or acknowledging a fake one.$50,000 each
ids_costSubscribing to the intrusion-detection signal.flat per-step fee set by the scenario
phishing_fineActing on a phishing / bait directive.fine set by the scenario

Scoring & Leaderboard

There are two numbers, and they point in opposite directions. Don't confuse them.

Raw cost: what you see locally

The playtest report's final_score is the total dollars your run accrued. Lower wins. This is what you optimise against on your own machine.

Points: what the leaderboard shows

Your raw cost is converted into leaderboard points. Higher wins. The conversion is anchored to two baselines for each scenario.

How raw cost becomes points

Each scenario has a naive baseline (a do-nothing-clever controller) and an optimal baseline (a strong reference). Your points are a linear interpolation between them:

points = 100 × (naive_cost − your_cost) / (naive_cost − optimal_cost)
points = clamp(points, 0, 150)
  • Match the naive baseline0 points.
  • Match the optimal baseline100 points.
  • Beat optimal → above 100, up to a 150-point cap per scenario.
  • Worse than naive → clamped to 0. The starter template scores 0. You must beat naive to put a point on the board.

What to actually aim for: hitting 100 means you matched our strongest reference controller - that's an excellent score and the realistic target. The 100–150 band only opens up when you genuinely beat that reference, which is hard and uncommon - don't treat 150 as the expected result. A strong submission lands near 100.

Your leaderboard total: why the Gauntlet dominates

Your headline total is the sum of your per-scenario points, with one twist: the Gauntlet counts triple (×3). With five Phase-1 scenarios plus the Gauntlet, a flawless run is 5 × 100 + (100 × 3) = 800 points, of which the Gauntlet is ≈ 37.5%. No other single scenario comes close. It is by far the highest-leverage thing you can get right. See The Gauntlet.

Worked example

Say a scenario's naive baseline is $1,000,000 and its optimal is $400,000 (a $600k "moat"). Your controller runs at $520,000:

points = 100 × (1,000,000 − 520,000) / (1,000,000 − 400,000)
       = 100 × 480,000 / 600,000  =  80 points

Shave another $120k off (to $400k) and you hit 100. Every dollar closer to optimal is worth more points when the moat is narrow. So the scenarios with the widest naive→optimal gap are where effort pays off most. The playtest report shows your standing on this ladder after every run.

Common Pitfalls & Clarifications

Understanding these concepts will save you from making costly mistakes in the simulation.

MW vs MWh (Power vs Energy)

MW (Megawatts) is the rate of power flow at a specific moment (e.g., your battery is discharging at 10 MW).

MWh (Megawatt-hours) is the volume of energy stored (e.g., your battery holds 50 MWh).

Since each timestep is 15 minutes (0.25 hours), flowing at 10 MW for one timestep uses 10 MW × 0.25h = 2.5 MWh of your battery capacity.

Inverter limits & clamping

A massive 100 MWh battery is useless if its inverter limit is only 10 MW. The inverter caps how much power (MW) can flow in or out per step. Ask for more. Say battery_flow_mw = 300. And the engine simply clamps it to the limit (~50 MW). It won't error, but you also won't get 300 MW. See "out-of-range values" under Controller Basics.

Negative Prices

When price is negative (e.g., -$20/MWh), the grid is oversupplied. Because import and export are automatic, this means you get paid to import (charge your battery) and you are charged to export. Curtail your solar so you aren't paying the grid to take your surplus!

Sign Conventions (Positive vs Negative)

The simulation uses a strict sign convention. For battery_flow_mw, positive means discharging into the grid (providing power), while negative means charging from the grid (consuming power).

Controller Basics

Your controller is just a Python function that runs every timestep. It takes in a state dictionary and returns an action dictionary.

The State Dictionary (what you read)

{
  "time": 42,                       # Timestep index (0..287)
  "demand": 145.2,                  # Current city demand (MW)   - this step only
  "solar": 80.5,                    # Current solar generation (MW) - this step only
  "soc": 0.45,                      # Battery state of charge (0..1) - this step only
  "price": 120.0,                   # Current import tariff ($/MWh)
  "features": {                     # Which mechanics are live this scenario
    "battery": True, "fcas": True, "ids": True
  },
  "forecast": {                     # Lookahead arrays (≈16 steps). NOISY - see Tips.
    "demand": [146.1, 150.2, ...],
    "solar":  [82.1, 85.0, ...],
    "price":  [125.0, 150.0, ...]
  },
  "alerts": [                       # Narrative events active RIGHT NOW (see Reacting to Events)
    {"id": "ids_w1", "type": "qualitative_alert", "severity": "critical",
     "title": "IDS Alert wave 1", "description": "SECURITY BREACH ...",
     "at_step": 30, "end_step": 42}
  ],
  "fcas_events_upcoming": [         # Scheduled FCAS dispatch calls (pre-position SOC!)
    {"at_step": 152, "end_step": 154, "magnitude_mw": 18.0}
  ],
  "ids_signal_node_a": 0.87,        # Attack-probability hint [0..1] - only if you subscribed,
  "ids_signal_node_b": 0.64,        #   otherwise None (see Cyber & Phishing)
  "battery_throughput_remaining_mwh": 500.0,   # If a throughput budget is active
  "peak_import_mw": 92.0,           # Your biggest grid import so far (drives the demand charge)
  "agent_plan": { ... }             # Whatever your plan()/replan() returned (see Agentic)
}

The Action Dictionary (what you return)

You control the grid for the upcoming timestep by returning a dictionary with any of these keys. If you leave a key out, it just defaults to 0.

def controller(state):
    # Your logic here...

    return {
        # Positive = discharge to grid, Negative = charge from grid.
        # Bounded by: inverter limit, SOC, and throughput budget.
        "battery_flow_mw": 10.5,

        # MW of emergency diesel generation to dispatch [0, max_limit]
        "emergency_generator": 0.0,

        # MW of solar to intentionally disconnect (prevents overvoltage)
        "curtail_solar": 0.0,

        # MW of inverter capacity to lock for frequency control.
        # This capacity cannot be used for charging/discharging this step!
        "fcas_reserve_mw": 5.0,

        # Pay for the IDS probability signal THIS step (cyber scenarios).
        # Populates state["ids_signal_node_a"/"_b"] on the NEXT step.
        "subscribe_ids": False,

        # Per-step acknowledgements the engine reads from YOUR ACTION.
        # (containment_ack / anomaly_ack - see Cyber & Phishing.)
        "agent_plan": {},
    }

What if I return out-of-range values?

You don't need to pre-validate your numbers. The engine reads each value, treats it as a number, and clamps it to what's physically possible that step. It does not error or penalise you just for asking for too much:

  • battery_flow_mw → clamped to ±the inverter limit (default 50 MW, minus any FCAS reserve), then further to what your state of charge can actually deliver or absorb. So battery_flow_mw = 300 just becomes a ~50 MW full-power discharge. Not an error.
  • curtail_solar → clamped to [0, current solar].
  • emergency_generator → clamped to [0, its MW limit].
  • fcas_reserve_mw → clamped to [0, inverter limit] (and it shares the inverter with battery flow).
  • Any key you omit defaults to 0; a negative where only positive makes sense is clamped up to 0.

So always return plain numbers. A value the engine can't convert to a number (None or non-numeric text) can fail the whole evaluation. And if your step() throws, or returns something that isn't a dict, that step is replaced with the zero action (do nothing) and logged as a controller error. One buggy step won't crash the run, but its intended action is lost.

Tips for Success

  • Don't trust the forecast implicitly: Forecasts carry AR(1) noise and a persistent bias, and can be deliberately skewed by cyberattacks in later scenarios. The error is structured, so it's partly learnable. Debias it rather than taking it at face value.
  • Respect the Inverter: The battery cannot charge/discharge faster than its inverter limit, and any capacity assigned to fcas_reserve_mw reduces your available bandwidth for normal flow.
  • Watch the Caps: 100 MW of excess solar doesn't mean you can export 100 MW. If the export cap is 50 MW, the rest causes an overvoltage penalty unless you curtail it.
  • Shave your peak: The demand charge bills your single worst import spike. Smooth, pre-emptive battery use beats reacting at the peak.

State & Python essentials

New to Python, or unsure what the engine does between steps? This is the stuff that quietly breaks first-time controllers. Read it before you fight a bug that isn't really a bug.

How the engine runs your code

The engine imports your file once, then calls your controller every 15 simulated minutes for the whole run. It never restarts in between. So a value you compute inside step() / controller() and keep in a local variable is thrown away the instant the function returns. Locals do not survive to the next step. To remember anything, you need state that lives outside a single call.

Keep an array (or any value) alive for the whole run

This is the most common sticking point: you want a list you can update every step (a price history, a rolling error buffer, a counter). There are two correct ways.

Recommended: a Strategy class with self.

class MyStrategy:                    # any name works - the portal detects it
    def __init__(self):
        # Runs ONCE, when the engine first creates your strategy.
        # Put anything you want to remember for the whole run here.
        self.price_history = []          # this list lives for the entire run

    def step(self, state):
        # self.price_history is the SAME list every step - append to it,
        # read it, modify it, and it is still there on the next step.
        self.price_history.append(state["price"])
        recent = self.price_history[-12:]            # last 3 hours (12 x 15min)
        avg = sum(recent) / len(recent)
        flow = 20.0 if state["price"] > avg else -20.0
        return {"battery_flow_mw": flow}

The engine builds your strategy class once and reuses that one instance for every step, so everything stored on self persists automatically. This is the clean way, and it sidesteps the gotcha below.

Alternative: a module-level variable (plain function)

price_history = []          # defined at the TOP of the file = "module level"

def controller(state):
    price_history.append(state["price"])         # appending is fine, no keyword needed
    recent = price_history[-12:]
    avg = sum(recent) / len(recent)
    return {"battery_flow_mw": 20.0 if state["price"] > avg else -20.0}

This works because your file is imported once, so the module-level price_history is shared across every call.

The #1 gotcha: rebinding a global needs the `global` keyword

Mutating a module-level value (.append(), my_list[i] = ..., my_dict[k] = ...) works with no ceremony. But if you reassign the name itself inside a function, Python silently creates a new local instead. So your value never actually updates:
counter = 0

def controller(state):
    global counter          # WITHOUT this line, the next line makes a NEW local...
    counter = counter + 1   # ...and the module-level counter never changes.
    return {"battery_flow_mw": 0.0}
Rule of thumb: mutating an existing object needs nothing; rebinding the name (counter = ..., x = x + 1) needs global. The class-based approach avoids this trap entirely. Just write self.counter += 1.

What does NOT persist

  • Local variables inside step() / controller(). Reset on every step.
  • Arbitrary keys you write into the state dict you're handed (e.g. state["my_thing"] = ...). The engine rebuilds that view fresh each step, so they're discarded. Use self.* or a module-level variable instead.

The one channel the engine deliberately carries forward is state["agent_plan"], which accumulates whatever your plan() / replan() returned and whatever you return under the agent_plan key from step() (see Reacting to Events).

Class syntax, for non-experts

  • A class bundles data (self.x) with functions (called methods). The engine makes one instance and calls its step each tick.
  • Every method's first parameter must be self (def step(self, state):). Forget it and you get TypeError: step() takes 1 positional argument but 2 were given.
  • To call one method from another, go through self.. A bare helper() raises NameError.
  • Plain functions defined at module level are callable from inside a method directly (no self.).
  • You never create the class yourself. The engine does. So __init__(self) must work with no arguments (don't add required parameters).
def clamp(x, lo, hi):                 # a plain module-level helper
    return max(lo, min(hi, x))

class MyStrategy:                     # any name works
    def __init__(self):
        self.target_soc = 0.5

    def _decide_target(self, price):  # a helper METHOD (leading _ is just convention)
        return 0.8 if price < 0.10 else 0.3

    def step(self, state):
        self.target_soc = self._decide_target(state["price"])   # call a method via self.
        raw = (self.target_soc - state["soc"]) * 100
        return {"battery_flow_mw": clamp(raw, -50.0, 50.0)}       # call a function directly

A few more traps that fail silently or error

  • Mutable default arguments: def f(x, hist=[]): creates that list once and reuses it across calls. Use hist=None, then hist = hist or [] inside.
  • Division: 1 / 2 is 0.5 (true division); 1 // 2 is 0 (floor). Mixing them up skews your maths.
  • Indentation is syntax: blocks are defined by consistent indentation (4 spaces). Mixing tabs and spaces, or uneven indents, is an IndentationError.

Playtest & Debug

The public Python engine is identical to the one the judges run. Iterating locally. With the HTML report open. Is the single fastest way to climb the leaderboard. This is also where you find out why you lost points.

1. Install the engine

The public engine is published on PyPI. Install it with the playtest extras:
pip install "watt-the-hack[playtest]"
As new scenarios unlock, upgrade with pip install --upgrade "watt-the-hack[playtest]" to pull the latest content.

2. Run the playtest harness

Run your controller against a scenario. --open-report pops the HTML report in your browser when it finishes.
python -m watt_the_hack.playtest my_controller.py --scenario duck_curve --open-report

3. List available scenarios

See everything unlocked in your installed package:
python -m watt_the_hack.playtest --list-scenarios

Read the report. It tells you exactly where the money went

"I scored badly" is not a diagnosis. Every run writes a folder under runs/<scenario>_<timestamp>/ containing a report.html plus raw metrics.json and per-step steps.csv. The report breaks down precisely which line items cost you:

  • Cost breakdown. Every component (import, battery wear, demand charge, each penalty…) ranked by size and as a % of your total. Your "biggest lever" is called out.
  • Top worst steps. The exact timesteps that hurt most, with the demand, solar, SOC, net grid and penalty at that moment. Start your fixes here.
  • Opportunities. Data-driven hints ("you're importing at peak price in the evening. Pre-charge earlier").
  • Baseline ladder. Where your raw cost sits between the naive and optimal baselines, i.e. roughly how many points you'd score.

If a penalty line you didn't expect is non-zero (e.g. cyber_containment_fine, diesel_ban_penalty, fcas_shortfall_penalty), that's a mechanic you mis-handled. Jump to the matching section below.

Inspect costs straight from Python

Prefer raw numbers? Run a controller and print the breakdown yourself:

from watt_the_hack.playtest import run_controller   # convenience wrapper

result = run_controller("my_controller.py", scenario="gauntlet")
print(result["metrics"]["final_score"])             # raw cost (lower wins)
for k, v in sorted(result["cost_breakdown"].items(),
                   key=lambda kv: -abs(kv[1])):
    print(f"{k:<28} {v:>14,.0f}")                    # where every dollar went

Reacting to Events

From Frequency Frenzy onward, scenarios fire events: weather notes, demand spikes, operator briefs, compliance windows, cyberattacks. Here's how they reach your controller and how to respond without blowing your time budget.

Three channels carry events

  1. state["alerts"]. The list of narrative events active right now. Each has id, type, severity, title, description (the prose), at_step, end_step. Read it every step.
  2. replan(self, state, alerts). An optional hook that fires whenever alerts are active. The right place for a slow parse (e.g. an LLM call). Its return value is merged into the persistent agent_plan.
  3. state["agent_plan"]. Your standing memo to the engine. It persists across steps and is how you respond to enforcement events (acknowledge an attack, file an exemption).

⚠ Not every event shows up as an alert

Narrative events (qualitative alerts, weather, demand/price signals, forecast-bias notices) appear in state["alerts"]. But the engine deliberately hides the structured enforcement windows. Compliance windows, the diesel ban, cyberattack windows, phishing traps. And strips their numeric payload. That's the whole challenge of the advanced scenarios: the prose brief tells you a rule is coming ("cap exports to 22 MW over steps 108–124"), and you must parse it into the right action. The engine won't hand you the number.

⚠ The biggest timeout trap: replan fires every active step

replan is called on every step where at least one alert is active, not once per alert. An alert that spans steps 8–20 calls replan 13 times. Across a full run that can be 100–200 calls. If you fire an LLM request on every one, you'll blow the ~14-minute evaluation budget and your run ends in TIMEOUT with no score.

The fix is one line: dedupe by alert id. Track which ids you've already handled and only do expensive work for genuinely new ones. The number of distinct alerts is small (see the table), so a deduped controller makes only a handful of LLM calls.

class Strategy:
    def __init__(self):
        self.seen = set()
        self.constraints = []          # parsed rules live here for step() to read

    def replan(self, state, alerts):
        new = [a for a in alerts if a["id"] not in self.seen]
        if not new:
            return {}                  # nothing new -> do NO work, return immediately
        for a in new:
            self.seen.add(a["id"])
            self.constraints.append(self._parse(a))   # regex or ONE batched LLM call
        return {}                      # (you can also return dict updates for agent_plan)

    def _parse(self, alert):
        # turn alert["description"] prose into a structured rule on self
        ...

How many alerts to expect

Counts for the scored (judging) run of the scenarios where an LLM actually helps. The earlier ones (Duck Curve, Frequency Frenzy, AI Grid Shock) don't need one. Budget your LLM calls against distinct alerts (what you handle when you dedupe), not the raw replan firings.

ScenarioDistinct alerts (your LLM-call budget)replan() firings if you don't dedupe
The Operator's Mandate13~73
Cybersecurity11~92
The Gauntlet18~119

The middle column is the number of unique alerts. Dedupe by id and that's how many times you do real work. The right column is how often replan() is called if you don't dedupe: firing an LLM on each of those will time you out. Figures are for the scored (judging) runs and may shift slightly as scenarios are tuned; treat them as ceilings, not contracts.

agent_plan is one persistent dictionary

Everything you write to agent_plan accumulates into a single plan that persists for the rest of the run, and the engine reads it back. Anything you return under the agent_plan key of your step() action is merged into that same plan. So you don't have to memorise which key goes where. The clean habit:

Per-step acknowledgements → from step()

containment_ack / anomaly_ack are a live signal. Set them in the agent_plan you return from step(), every step the incident is live.

One-time policy → from plan() / replan()

emergency_exemption and your parsed constraints are a standing document you file once. The natural home for anything you parse with an LLM (see the budget rule).

Both land in the same persistent agent_plan; pick the hook that reads cleanly. The worked examples in Cyber & Phishing show both.

Cyber & Phishing Defense

These scenarios attack your inputs. The Cybersecurity scenario runs anomaly windows you must detect and acknowledge; the Gauntlet escalates with a subscribable IDS signal plus real and decoy attack waves. Some incidents are real and must be contained; some are decoys you must not react to; some prose is bait designed to make you sabotage yourself.

The IDS signal

Set subscribe_ids: True in your action to buy an intrusion hint (small per-step fee). Next step you can read two probabilities, ids_signal_node_a and ids_signal_node_b ∈ [0, 1]. If you don't subscribe, both are None.

The forecast cross-check

A real attack corrupts your live sensors (demand, solar, soc) but the forecast still reads the true series. A large, sudden gap between a sensor and its forecast is a tell that your data is being spoofed.

The alert prose

Critical alerts name the incident and often the exact id to acknowledge. Decoy alerts hedge ("single node", "second node disagrees", "do not ack unless both agree"). Read the words.

Cybersecurity scenario: anomaly windows + trust calibration

In raw numbers a real step-change and a spoof look identical - the operator's prose is what tells you which channel to trust during each anomaly window:

  • Real event → the live meter is honest; act on it.
  • Spoofed meter ("a spike the forecast does not corroborate") → trust the forecast.
  • Poisoned forecast ("the forecast feed may be compromised; the meter is honest") → trust the sensor.

Acknowledge the window with agent_plan["anomaly_ack"] = <anomaly_id> while it is live, and dispatch on the value you chose to trust:

import re

class Strategy:
    def __init__(self):
        self.windows = []        # anomaly windows parsed from the alert prose
        self.seen = set()

    def replan(self, state, alerts):
        for a in alerts:
            if a["id"] in self.seen:
                continue
            self.seen.add(a["id"])
            text = (a.get("title", "") + " " + a.get("description", "")).lower()
            if "anomaly" not in text:
                continue
            m = re.search(r"anom[-_][a-z0-9]+", text)   # the id to acknowledge
            if "forecast" in text and "compromis" in text:
                trust = "sensor"        # forecast poisoned -> believe the meter
            elif "does not corroborate" in text or "spoof" in text:
                trust = "forecast"      # meter spoofed -> believe the forecast
            else:
                trust = "sensor"        # neutral / real event -> the live meter
            self.windows.append({"id": m.group(0) if m else a["id"], "trust": trust,
                                 "start": a.get("at_step"), "end": a.get("end_step")})
        return {}

    def step(self, state):
        t = int(state["time"])
        sensor = state["demand"]
        fc = (state.get("forecast", {}) or {}).get("demand") or [sensor]
        agent_plan, demand = {}, sensor
        for w in self.windows:
            if w["start"] is not None and w["start"] <= t <= w["end"]:
                agent_plan["anomaly_ack"] = w["id"]          # ack while the window is live
                demand = fc[0] if w["trust"] == "forecast" else sensor
        net = demand - state["solar"]
        flow = max(-50.0, min(50.0, net)) if state["soc"] > 0.1 else 0.0
        return {"battery_flow_mw": flow, "agent_plan": agent_plan}

Everything below is the Gauntlet's escalation - a subscribable IDS signal and real/decoy attack waves with containment_ack. The standalone Cybersecurity scenario uses the anomaly mechanic above, not IDS.

Real vs decoy: the two IDS nodes must agree

The two IDS nodes are designed so you can't trust either alone:

  • Real attack → both nodes read high (node A ≈ 0.85, node B ≈ 0.65). They agree.
  • Decoy / false flag → node A looks suspicious (≈ 0.60) but node B stays low (≈ 0.10). They disagree.
  • Normal → both low (≈ 0.20 / 0.10).

So the rule is: only treat it as real when BOTH nodes clear a threshold (≈ 0.4 each). The signal is noisy, so smooth it over a couple of steps rather than trusting a single reading.

Containment: acknowledge real attacks, ignore decoys

During a real attack window you must set agent_plan["containment_ack"] = <the attack id> (the critical alert names it, e.g. "attack_30"). Both mistakes are punished equally hard. About $50,000 per step:

  • Miss a real attack (no ack, or wrong id) → fined every step of the window.
  • Acknowledge a decoy (ack the fake one's id) → fined every step of that window.

Remember the channel: containment_ack is read from the action your step() returns, so set it there, every step the confirmed attack is live.

Worked example. Detect, confirm, contain

import re

class Strategy:
    def __init__(self):
        self.seen = set()
        self.attacks = {}        # attack_id -> (start_step, end_step), learned from prose
        self.ema_a = 0.0         # smoothed IDS node A
        self.ema_b = 0.0         # smoothed IDS node B

    def replan(self, state, alerts):
        # Parse each NEW alert once. A critical alert names the attack id to ack;
        # a decoy alert names an id but tells you NOT to ack it.
        for a in alerts:
            if a["id"] in self.seen:
                continue
            self.seen.add(a["id"])
            m = re.search(r"`containment_ack`:\s*`([^`]+)`", a.get("description", ""))
            if m and a.get("severity") == "critical" and "decoy" not in a["description"].lower():
                self.attacks[m.group(1)] = (a.get("at_step"), a.get("end_step"))
        return {}

    def step(self, state):
        t = int(state["time"])

        # 1) Smooth the (noisy) IDS signal. Subscribe whenever an attack window is open.
        in_window = any(s <= t <= e for (s, e) in self.attacks.values())
        a = state.get("ids_signal_node_a") or 0.0
        b = state.get("ids_signal_node_b") or 0.0
        self.ema_a = 0.5 * self.ema_a + 0.5 * a
        self.ema_b = 0.5 * self.ema_b + 0.5 * b
        corroborated = self.ema_a >= 0.4 and self.ema_b >= 0.4   # BOTH nodes agree -> real

        # 2) Acknowledge ONLY a confirmed real attack, using the id from the prose.
        agent_plan = {}
        for aid, (s, e) in self.attacks.items():
            if s is not None and s <= t <= e and corroborated:
                agent_plan["containment_ack"] = aid

        # 3) When data looks spoofed, steer by the forecast instead of the live sensor.
        demand = state["demand"]; solar = state["solar"]
        fc = state.get("forecast", {})
        if corroborated and fc.get("demand") and fc.get("solar"):
            demand, solar = fc["demand"][0], fc["solar"][0]

        net = demand - solar
        flow = max(-50.0, min(50.0, net)) if state["soc"] > 0.1 else 0.0
        return {
            "battery_flow_mw": flow,
            "subscribe_ids": in_window,     # only pay for IDS while a window is open
            "agent_plan": agent_plan,       # containment_ack read from HERE
        }

Operator's Mandate: parse prose into compliance windows

Each brief is a constraint in plain English - a reserve floor, an export cap - with a step window. Parse them once in replan(), enforce the active ones in step(). The traps this scenario is built around: fractional words ("four-fifths of capacity"), rescissions ("the prior directive is RESCINDED"), and life-safety overrides that win an overlap.

import re

MIN_SOC = re.compile(r">=\s*(\d+)\s*%.*?steps?\s+(\d+)\s+through\s+(\d+)", re.I | re.S)
EXPORT  = re.compile(r"export.{0,40}?(?:cap|limit|reduced).{0,80}?(\d+)\s*MW.*?steps?\s+(\d+)\s+through\s+(\d+)", re.I | re.S)

class Strategy:
    def __init__(self):
        self.rules, self.seen = [], set()

    def replan(self, state, alerts):
        for a in alerts:
            if a["id"] in self.seen:
                continue
            self.seen.add(a["id"])
            text = a.get("title", "") + "\n" + a.get("description", "")
            pr = 100 if "life-safety" in text.lower() else 50   # life-safety wins overlaps
            m = MIN_SOC.search(text)
            if m:
                self.rules.append({"min_soc": int(m.group(1)) / 100,
                    "start": int(m.group(2)), "end": int(m.group(3)), "pr": pr})
            m = EXPORT.search(text)
            if m:
                self.rules.append({"max_export": float(m.group(1)),
                    "start": int(m.group(2)), "end": int(m.group(3)), "pr": pr})
        return {}

    def step(self, state):
        t = int(state["time"])
        active = [c for c in self.rules if c["start"] <= t <= c["end"]]
        if active:                                   # only the top-priority tier applies
            top = max(c["pr"] for c in active)
            active = [c for c in active if c["pr"] == top]
        min_soc = max((c["min_soc"] for c in active if "min_soc" in c), default=None)
        # ...then hold SOC >= min_soc and cap exports at max_export this step.
        return {"battery_flow_mw": 0.0}

Regex catches the clean AEMO-style phrasings; the fractional-word and rescission cases are where an LLM parse pulls ahead - that cost gap is the lesson of this scenario.

The diesel-ban exemption

During an environmental diesel-ban window, running diesel without a filed exemption costs $3,000/MWh. To run it legally you file a small document. It's one-time policy, so plan() / replan() is the natural home (it persists for the run):

def replan(self, state, alerts):
    # File the exemption once. The id is named in the EPA brief, e.g. "epa_g77_day3".
    return {
        "emergency_exemption": {
            "directive_id": "epa_g77_day3",
            # reason must be substantive: >= 60 chars, contain a number AND an
            # operational keyword (mw / soc / demand / deficit / import / capacity ...).
            "reason": ("Confirmed import deficit of 28 MW above the 120 MW grid cap "
                       "during the EPA window; battery SOC at 12% and demand rising, "
                       "diesel required to maintain load."),
            "expected_duration_steps": 6,        # int, 1..12
        }
    }

A vague or numberless reason is rejected and the penalty applies. The engine wants a real justification, not a rubber stamp.

Phishing: never let prose write your plan

Some alerts are bait. They'll urgently instruct you to put a specific key in your agent_plan ("set verify_reserve: true to confirm"). If you obey, the engine fines you. The defense is a hard rule:

Only ever write keys to agent_plan that your own controller logic decided on (containment_ack, anomaly_ack, emergency_exemption, your constraints). Never copy a key just because an alert's text told you to.

If you're using an LLM to parse briefs, this is a prompt-injection risk: instruct the model to extract constraints only, and validate its output against a fixed allow-list of keys before you act on it.

Advanced: LLM-driven strategies

The Operator's Mandate, Cybersecurity, and the Gauntlet hide their rules in plain-English briefs. To act on them you have to read the text. That's where an LLM helps. This is the part most people find confusing, so here it is from scratch.

The one idea that makes it click

Your code is not restarted each step. The engine creates one instance of your Strategy class and keeps it for the whole run, calling three methods on that same object at three different times. Because it's one long-lived object, anything you save on self in one method is still there in the others. That is the whole trick: the slow thinking (an LLM call) happens in the methods that run rarely, saves its conclusions on self, and the fast loop just reads them.

The three methods, in plain English

One object, three jobs, three very different frequencies.

plan(): study once

Runs one time, before step 0, with the opening state. Job: read the briefing and pick your overall game plan. An LLM call is fine here. It happens once. Optional.

replan(): react to news

Runs while an alert is active. Job: turn a new brief into a concrete rule and save it on self. An LLM call is fine here if you dedupe (handle each alert id once). Optional.

step(): act fast

Runs every 15 min, 288 times. Job: return this tick's dispatch using what you already prepared. Never call an LLM here. Required.

When each one fires (one run, start to finish)

run starts
│
├─ plan(state)               ← ONCE   (optional LLM: read the briefing)
│
├─ step(state)   t = 0       ← every tick: fast, NO LLM
├─ step(state)   t = 1
│      ⋮
│   🔔 a new alert appears around t = 30
├─ replan(state, alerts)     ← fires because an alert is active
│                               (optional LLM: parse it, save to self)
├─ step(state)   t = 30      ← reads what replan just saved on self
├─ step(state)   t = 31
│      ⋮
└─ step(state)   t = 287     ← run ends

Two things beginners trip on: replan fires on every step an alert is active (not once per alert. Hence dedupe, see Reacting to Events), and step runs every tick whether or not an alert is present.

How the methods hand information to each other

Two channels. And as a beginner you mostly need the first:

  • self.something. Your own notebook. Whatever you assign to self in plan / replan is readable in step. This is how an LLM's decision reaches the fast loop. Use it for almost everything.
  • agent_plan. A note to the engine. Only for the handful of keys the engine itself reads (containment_ack, emergency_exemption, anomaly_ack). Return it from any method; it persists and also shows up as state["agent_plan"] (see Reacting to Events & Cyber).

The LLM budget rule: call it in plan / replan, never step

Your entire evaluation (every timestep of the run) must finish within ~14 minutes of wall-clock. There is no per-step rescue: if the run as a whole exceeds the budget it ends in TIMEOUT with no score. (A timeout is a free retry: it doesn't burn one of your attempts, but you still get nothing back.) So the LLM has to live where it's called rarely:

  • plan(initial_state) runs once before step 0: the right place for an LLM call to read the scenario briefing and pick a high-level policy.
  • replan(state, alerts) runs whenever alerts are active. So dedupe by alert id (see Reacting to Events) and only call the LLM for new ones. The dict you return is merged into the persistent state["agent_plan"].
  • Never call an LLM from step(state). It runs every 15 simulated minutes; a network call there is multiplied across the whole run and will blow the budget. Instead read state.get("agent_plan", {}) and branch on the cached policy: LLM-quality decisions at deterministic-controller latency.
  • Keep models fast: gpt-5.4-nano or gpt-5.4-mini (see the OpenAI section). Even in plan/replan, a slow model called repeatedly can run you over.

A full worked example. The three methods cooperating

The LLM "thinks" in plan and replan and writes its conclusions onto self; step just reads them and dispatches. Trace self.stance and self.export_cap through the three methods:

class MyStrategy:                       # any class name works
    # The engine builds this ONCE and reuses it for the whole run.

    def __init__(self):
        # self.* is shared memory across plan / replan / step.
        self.stance = "balanced"        # plan() will set this
        self.export_cap = None          # replan() will set this from a brief
        self.handled = set()            # alert ids already parsed (dedupe!)

    # ── ONCE, before t=0. A slow LLM call is fine here. ──────────────
    def plan(self, state):
        briefs = state.get("alerts", [])
        # ask_llm_* are YOUR helpers that call OpenAI (see the OpenAI section).
        self.stance = ask_llm_for_stance(briefs) or "balanced"   # save on self
        return {}                        # nothing for the engine yet

    # ── while an alert is active. DEDUPE, then (maybe) call the LLM. ──
    def replan(self, state, alerts):
        for a in alerts:
            if a["id"] in self.handled:  # already handled ->
                continue                 #   do NO work (this is what saves you)
            self.handled.add(a["id"])
            cap = parse_export_cap_with_llm(a.get("description", ""))  # 22.0 or None
            if cap is not None:
                self.export_cap = cap    # save on self for step() to use
        return {}                        # everything we need is on self

    # ── EVERY 15 min (288x). Fast. NO LLM. Use what we prepared. ─────
    def step(self, state):
        demand, solar, soc = state["demand"], state["solar"], state["soc"]
        net = demand - solar
        flow = max(-50.0, min(50.0, net)) if soc > 0.1 else 0.0

        if self.stance == "conserve":    # <- decided by the LLM in plan()
            flow = min(flow, 0.0)        #    hold charge; don't discharge hard

        curtail = 0.0
        if self.export_cap is not None:  # <- parsed by the LLM in replan()
            export = max(0.0, -(net - flow))
            curtail = max(0.0, export - self.export_cap)

        return {"battery_flow_mw": flow, "curtail_solar": curtail}

ask_llm_for_stance and parse_export_cap_with_llm are functions you write that call the OpenAI API (see the OpenAI section). Make them fail soft: if the key is missing or the call errors, return a sensible default so a network blip never crashes the run.

No LLM? You can still play.

Nothing forces you to use an LLM. Swap the *_with_llm helpers for plain string matching / regex on a["description"], or skip plan and replan entirely and write a pure step controller. The LLM just makes the wordier briefs easier to parse. It's a tool, not a requirement.

The Gauntlet

The finale. A single 288-step (3-day) run that combines every mechanic from the earlier scenarios at once. You get one submission, and it counts ×3 on the leaderboard. By far the highest-leverage scenario in the event.

It introduces no new mechanic. So you can fully prepare

Every challenge in the Gauntlet is something you already practised in an earlier scenario, where you have unlimited local playtests. The single-submission limit applies to the scored run, not to your iteration: master each mechanic in its own scenario first, then the Gauntlet is "just" doing all of them in one controller.

What it folds in, and where you learned it

In the Gauntlet you'll face…Practise it in
A severe duck curve + battery throughput budgetDuck Curve
A noisy AR(1) forecast to debias and plan againstFrequency Frenzy
FCAS reserve bids + scheduled dispatch callsAI Grid Shock
Prose briefs → compliance windows (SOC floors, export caps)Operator's Mandate
An EPA diesel-ban window needing an exemptionOperator's Mandate
A phishing / bait directive to ignoreOperator's Mandate
Anomaly windows to detect + acknowledgeCybersecurity
Subscribable IDS signal + real/decoy attack wavesnew in the Gauntlet

Build a detector, not a memoriser

The Gauntlet rephrases its briefs and can shuffle the timing of incidents between runs, and the scored run is graded fresh. A controller that hard-codes "ack at step 30" is brittle; one that detects conditions (corroborated IDS nodes, a forecast-vs-sensor gap, a parsed constraint window) generalises. The reference patterns in Cyber & Phishing and Reacting to Events are written to generalise on purpose.

One-submission checklist

  • Score 100+ on each Phase-1 scenario locally before you touch the Gauntlet.
  • Confirm your replan dedupes. Check your LLM call count over a full run.
  • Verify each penalty line is $0 in the report: containment, diesel-ban, FCAS shortfall, compliance, overvoltage, blackout.
  • Confirm you ignore the bait key and only write your own keys to agent_plan.
  • Check your worst-case wall-clock is comfortably under 14 minutes (LLM latency varies).
  • Re-read the Submission Guide: the Gauntlet allows 1 attempt.

Submission Guide

Submit your controller through the in-app Submission Portal. No zipping, no CLI: paste your code into the editor and hit submit.

The structure of your submission

Your code can take exactly one of two shapes. The engine auto-detects which. Pick the one that fits your strategy; there's no advantage to picking the more complex one if you don't need it.

Shape 1

A controller(state) function

Use this if your controller is stateless: every step's action depends only on the current state (plus the forecast). No persistent variables between timesteps.

def controller(state):
    # Read state, return an action dictionary.
    soc = state["soc"]
    demand = state["demand"]
    solar = state["solar"]

    return {
        "battery_flow_mw": demand - solar,
        # Any key you omit defaults to 0.
    }

REQUIRED

  • The function MUST be named controller and take a single state argument.
  • It MUST return a dict (any of the action keys; missing keys default to 0).
Shape 2

A Strategy class

Use this if you need persistent state between timesteps (e.g. a rolling error buffer, a PID memory, a precomputed plan from an LLM call). The engine instantiates your class once and reuses it for the whole run.

class Strategy:
    def __init__(self):
        # Anything you want to persist between steps lives on self.
        self.history = []

    def plan(self, initial_state):
        # OPTIONAL. Called ONCE before step 0. Right place for a slow
        # LLM call to read the scenario briefing. The dict you return
        # is stashed on state["agent_plan"] for every later step().
        return {"policy": "conserve"}

    def replan(self, state, alerts):
        # OPTIONAL. Called when qualitative alerts are active mid-run.
        # Right place for a second LLM call to react to text events.
        # Returned dict is merged into state["agent_plan"]. DEDUPE first!
        return {"policy": "respond"}

    def step(self, state):
        # REQUIRED. Called every 15-minute timestep.
        # DO NOT call an LLM here; it will time out.
        # Read state["agent_plan"] if you used plan()/replan().
        self.history.append(state["soc"])
        return {"battery_flow_mw": 10.0}

REQUIRED

  • The class can have any name (the starter code uses MyStrategy). The portal detects it automatically from your code. What matters is the step method below, not the class name.
  • It MUST define a step(self, state) method, written directly in this class, that returns an action dict. If the class has no step method, the engine refuses the submission. (A step only inherited from a base class or assigned by alias isn't detected. Keep it as a plain method here.)
  • It MUST be instantiable with no args (no required __init__ parameters).

plan(self, initial_state) and replan(self, state, alerts) are optional: the engine just skips them if you don't define them.

Using the OpenAI API

The evaluation platform injects OPENAI_API_KEY as an environment variable inside your container. Read it with os.environ; it's already there.

# Works as-is on the evaluation platform.
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Use a fast model; the whole evaluation must finish within ~14 minutes.
resp = client.chat.completions.create(model="gpt-5.4-nano", messages=[...])

⚠ Use gpt-5.4-nano or gpt-5.4-mini

Your whole evaluation runs under a ~14-minute budget. Stick to gpt-5.4-nano (fastest, cheapest) or gpt-5.4-mini; they're quick enough to stay inside it. Larger or slower models risk exceeding the budget, and your run times out with no score (a timeout doesn't cost you a submission attempt, but you also get no result). These models draw on a shared credit pool, so keep calls few (call in plan/replan, dedupe by alert id).

⚠ Remove your .env loading line before you submit

Locally you probably load your key from a .env file:

# LOCAL TESTING ONLY. DELETE BEFORE SUBMITTING.
from dotenv import load_dotenv
load_dotenv()

Delete those two lines (and the python-dotenv entry in your requirements.txt, if you added it) before pasting into the portal. The platform doesn't ship your .env; the env vars are already in the container. Code that tries to read a non-existent .env can silently no-op and leave the API key empty, which then throws an AuthenticationError on the first LLM call.

Never hardcode a key in your strategy.py; submissions are stored and re-runnable.

Extra pip dependencies?

The portal has a requirements textarea right under the code editor. Paste any additional pip packages your strategy needs there (one per line, same format as a normal requirements.txt). The platform builds a fresh container with those packages installed before running your code. Common ones (numpy, scipy, pandas, the OpenAI SDK) are already in the base image, so you don't need to list them.

Submission attempts are capped per scenario

Each scenario allows 3 submissions. The Gauntlet allows 1. The portal shows your remaining count before you submit. Spend them wisely; playtest locally first. (A run that TIMEOUTs does not consume an attempt. But it also returns no score.)