Welcome to the Watt-The-Hack Advanced Track docs. This is the complete reference for the grid components, financial constraints, scoring, event mechanics, and the API your controller talks to. Read the Quickstart first, then dip into whichever section you need.
One step
15 simulated minutes
A scenario run
288 steps (72 h / 3 days)
Your job
Meet demand at the lowest cost
Every 15 simulated minutes the engine hands your controller a state dictionary (current demand, solar, price, battery charge, a short forecast, and any active alerts). You return an action dictionary saying how to dispatch the battery, diesel, solar curtailment and grid reserves. The engine applies physics + market rules, charges you the cost of that step, and loops. Your score is the total cost over the whole run, lower is better.
From zero to a scored submission in four moves. Each links to the section with the detail.
pip install "watt-the-hack[playtest]"The same engine the judges run. Develop on your own machine with full Python. Imports, LLM calls, anything.controller(state) or a Strategy class with a step(self, state) method, returning an action dict. Start from a template in the Submission Portal.python -m watt_the_hack.playtest my_controller.py \
--scenario duck_curve --open-reportThe HTML report tells you exactly where your money went.The simulation models a single-node city grid. Your controller sits right at the center, balancing generation and demand.
state["price"]); any surplus is exported (at the export tariff). Both respect strict Import Caps (default 120 MW) and Export Caps (default 50 MW).fcas_reserve_mw = how much inverter capacity you keep on call for the grid, and get paid $40/MW per hour just for holding it ready, whether or not it's ever used. It's a promise, not a discharge. Full mechanics in the FCAS explainer below.Think of the inverter as the gateway between your battery and the rest of the grid. Even if your battery is completely full, it can't discharge faster than its inverter limit.
Additionally, the inverter limit is shared between active battery_flow_mw and passive fcas_reserve_mw. If you have a 50 MW inverter and you reserve 20 MW for FCAS, you can only charge or discharge your battery at a maximum rate of 30 MW for that timestep. Managing this trade-off is crucial to advanced optimization.
What it is in real life: FCAS (Frequency Control Ancillary Services) is a real market run by AEMO, Australia's grid operator. The grid must stay at almost exactly 50 Hz; when a large generator trips, fast assets like batteries are paid to inject or soak up power within seconds to catch the frequency. Crucially, they're paid mostly for being on standby, not for the energy they end up moving. The Hornsdale "Tesla big battery" in South Australia earns a large share of its income from FCAS, not from buying and selling energy. This scenario models that market.
In the game it's a pure bid. Each step you set fcas_reserve_mw = how many MW you promise to keep available for the grid. You move no energy and pay nothing to bid. You're just claiming "I'm holding this much in reserve."
You're paid to hold it: $40 per MW per hour, whether or not it's ever called. Reserve 10 MW for an hour and you earn $400 for doing nothing but staying ready (per 15-minute step that's $10/MW).
It costs you inverter headroom. FCAS gets first claim on the inverter: |battery_flow_mw| + fcas_reserve_mw ≤ inverter limit (default 50 MW). Every MW you reserve is a MW you can't use for arbitrage that step. Changing your bid sharply between steps also costs a small ramp charge (~$500 per MW of change), so keep it steady.
Dispatch events test your claim. Sometimes the grid actually calls your reserve, a fcas_dispatch event asks for a number of MW over a window. You're warned ahead of time: read state["fcas_events_upcoming"] (each entry has at_step, end_step, magnitude_mw). When called you must actually deliver from stored charge, so a reserve bid is only as good as the battery SOC backing it.
⚠ Failing a dispatch is the harshest penalty in the game
So FCAS is reliable, near-free income. But only bid what you can truly back with charge when a dispatch lands, and keep some SOC in the tank ahead of the windows the engine warns you about.
Every step accrues cost. The sum over the run is your raw score (lower is better; a negative raw score means you turned a profit). Here's every lever that moves it.
state["price"]. The goal is simple: buy low, avoid buying high.price goes negative, exporting costs you. See Negative Prices under Common Pitfalls.state["grid_co2_intensity"]). Cleaner dispatch is cheaper dispatch.curtail_solar or charge the battery to absorb the excess.state["battery_throughput_remaining_mwh"]. Once depleted, the battery is locked. Spend cycles where they matter most.Every line that can appear in your cost_breakdown, with the exact rate the engine applies. Positive numbers are costs; green is income. Most rows read $0 unless their mechanic is active in the scenario you're running - the bottom two groups only fire in the advanced scenarios.
| Line item | What it is | Rate |
|---|---|---|
| Core - charged in every scenario | ||
tariff_import | Buying power from the external grid. | price × MWh imported - dynamic, read state["price"] ($/MWh) |
tariff_export | Selling surplus solar or battery power back to the grid. | $50 / MWh exported (income) |
demand_charge | Your single biggest import spike, billed once for the whole run (peak-shaving discipline). | $1,000 / MW of peak import |
carbon_cost | Carbon price on CO₂ from grid imports + diesel. Scenarios can override grid intensity. | $50 / kg CO₂ · grid ≈ 0.7, diesel ≈ 0.27 kg/MWh |
battery_wear | Wear from cycling the battery - charging or discharging. | $50 / MWh moved |
ramp_charge | Smoothness penalty on the step-to-step change in net grid power. | (ΔMW)² × $1 |
generator_fuel | Running the diesel generator. | $1,000 / MWh |
blackout_penalty | Unmet demand (load shed). Avoid at all costs. | $100,000 / MWh |
overvoltage_penalty | Exporting beyond the grid export cap. | $5,000 / MWh |
| FCAS - when the scenario enables the reserve market | ||
fcas_revenue | Availability pay for capacity you keep on standby (whether or not it's called). | $40 / MW / hour reserved (income) |
fcas_dispatch_bonus | Energy you actually deliver when a dispatch event calls your reserve. | $200 / MWh delivered (income) |
fcas_shortfall_penalty | Failing to deliver a called dispatch - bid only what your SOC can back. | $100,000 / MWh short |
fcas_ramp_charge | Volatility in your reserve bid between steps. | $500 / MW of change |
| Advanced scenarios only - $0 unless that mechanic is live | ||
compliance_penalty | Breaching an Operator's-Mandate window you opted into (SOC floor / export cap). | $2,000,000 / SOC-unit · $500,000 / MW over cap, per step |
diesel_ban_penalty | Running diesel during a ban with no valid agent_plan exemption. | $3,000 / MWh |
anomaly_ack_fine | Each step inside an anomaly window you didn't acknowledge in agent_plan. | $5,000 / step |
cyber_containment_fine | Missing a real attack, or acknowledging a fake one. | $50,000 each |
ids_cost | Subscribing to the intrusion-detection signal. | flat per-step fee set by the scenario |
phishing_fine | Acting on a phishing / bait directive. | fine set by the scenario |
There are two numbers, and they point in opposite directions. Don't confuse them.
The playtest report's final_score is the total dollars your run accrued. Lower wins. This is what you optimise against on your own machine.
Your raw cost is converted into leaderboard points. Higher wins. The conversion is anchored to two baselines for each scenario.
Each scenario has a naive baseline (a do-nothing-clever controller) and an optimal baseline (a strong reference). Your points are a linear interpolation between them:
points = 100 × (naive_cost − your_cost) / (naive_cost − optimal_cost)
points = clamp(points, 0, 150)What to actually aim for: hitting 100 means you matched our strongest reference controller - that's an excellent score and the realistic target. The 100–150 band only opens up when you genuinely beat that reference, which is hard and uncommon - don't treat 150 as the expected result. A strong submission lands near 100.
Your headline total is the sum of your per-scenario points, with one twist: the Gauntlet counts triple (×3). With five Phase-1 scenarios plus the Gauntlet, a flawless run is 5 × 100 + (100 × 3) = 800 points, of which the Gauntlet is ≈ 37.5%. No other single scenario comes close. It is by far the highest-leverage thing you can get right. See The Gauntlet.
Say a scenario's naive baseline is $1,000,000 and its optimal is $400,000 (a $600k "moat"). Your controller runs at $520,000:
points = 100 × (1,000,000 − 520,000) / (1,000,000 − 400,000)
= 100 × 480,000 / 600,000 = 80 pointsShave another $120k off (to $400k) and you hit 100. Every dollar closer to optimal is worth more points when the moat is narrow. So the scenarios with the widest naive→optimal gap are where effort pays off most. The playtest report shows your standing on this ladder after every run.
Understanding these concepts will save you from making costly mistakes in the simulation.
10 MW × 0.25h = 2.5 MWh of your battery capacity.battery_flow_mw = 300. And the engine simply clamps it to the limit (~50 MW). It won't error, but you also won't get 300 MW. See "out-of-range values" under Controller Basics.price is negative (e.g., -$20/MWh), the grid is oversupplied. Because import and export are automatic, this means you get paid to import (charge your battery) and you are charged to export. Curtail your solar so you aren't paying the grid to take your surplus!battery_flow_mw, positive means discharging into the grid (providing power), while negative means charging from the grid (consuming power).Your controller is just a Python function that runs every timestep. It takes in a state dictionary and returns an action dictionary.
{
"time": 42, # Timestep index (0..287)
"demand": 145.2, # Current city demand (MW) - this step only
"solar": 80.5, # Current solar generation (MW) - this step only
"soc": 0.45, # Battery state of charge (0..1) - this step only
"price": 120.0, # Current import tariff ($/MWh)
"features": { # Which mechanics are live this scenario
"battery": True, "fcas": True, "ids": True
},
"forecast": { # Lookahead arrays (≈16 steps). NOISY - see Tips.
"demand": [146.1, 150.2, ...],
"solar": [82.1, 85.0, ...],
"price": [125.0, 150.0, ...]
},
"alerts": [ # Narrative events active RIGHT NOW (see Reacting to Events)
{"id": "ids_w1", "type": "qualitative_alert", "severity": "critical",
"title": "IDS Alert wave 1", "description": "SECURITY BREACH ...",
"at_step": 30, "end_step": 42}
],
"fcas_events_upcoming": [ # Scheduled FCAS dispatch calls (pre-position SOC!)
{"at_step": 152, "end_step": 154, "magnitude_mw": 18.0}
],
"ids_signal_node_a": 0.87, # Attack-probability hint [0..1] - only if you subscribed,
"ids_signal_node_b": 0.64, # otherwise None (see Cyber & Phishing)
"battery_throughput_remaining_mwh": 500.0, # If a throughput budget is active
"peak_import_mw": 92.0, # Your biggest grid import so far (drives the demand charge)
"agent_plan": { ... } # Whatever your plan()/replan() returned (see Agentic)
}You control the grid for the upcoming timestep by returning a dictionary with any of these keys. If you leave a key out, it just defaults to 0.
def controller(state):
# Your logic here...
return {
# Positive = discharge to grid, Negative = charge from grid.
# Bounded by: inverter limit, SOC, and throughput budget.
"battery_flow_mw": 10.5,
# MW of emergency diesel generation to dispatch [0, max_limit]
"emergency_generator": 0.0,
# MW of solar to intentionally disconnect (prevents overvoltage)
"curtail_solar": 0.0,
# MW of inverter capacity to lock for frequency control.
# This capacity cannot be used for charging/discharging this step!
"fcas_reserve_mw": 5.0,
# Pay for the IDS probability signal THIS step (cyber scenarios).
# Populates state["ids_signal_node_a"/"_b"] on the NEXT step.
"subscribe_ids": False,
# Per-step acknowledgements the engine reads from YOUR ACTION.
# (containment_ack / anomaly_ack - see Cyber & Phishing.)
"agent_plan": {},
}You don't need to pre-validate your numbers. The engine reads each value, treats it as a number, and clamps it to what's physically possible that step. It does not error or penalise you just for asking for too much:
battery_flow_mw → clamped to ±the inverter limit (default 50 MW, minus any FCAS reserve), then further to what your state of charge can actually deliver or absorb. So battery_flow_mw = 300 just becomes a ~50 MW full-power discharge. Not an error.curtail_solar → clamped to [0, current solar].emergency_generator → clamped to [0, its MW limit].fcas_reserve_mw → clamped to [0, inverter limit] (and it shares the inverter with battery flow).So always return plain numbers. A value the engine can't convert to a number (None or non-numeric text) can fail the whole evaluation. And if your step() throws, or returns something that isn't a dict, that step is replaced with the zero action (do nothing) and logged as a controller error. One buggy step won't crash the run, but its intended action is lost.
fcas_reserve_mw reduces your available bandwidth for normal flow.New to Python, or unsure what the engine does between steps? This is the stuff that quietly breaks first-time controllers. Read it before you fight a bug that isn't really a bug.
The engine imports your file once, then calls your controller every 15 simulated minutes for the whole run. It never restarts in between. So a value you compute inside step() / controller() and keep in a local variable is thrown away the instant the function returns. Locals do not survive to the next step. To remember anything, you need state that lives outside a single call.
This is the most common sticking point: you want a list you can update every step (a price history, a rolling error buffer, a counter). There are two correct ways.
class MyStrategy: # any name works - the portal detects it
def __init__(self):
# Runs ONCE, when the engine first creates your strategy.
# Put anything you want to remember for the whole run here.
self.price_history = [] # this list lives for the entire run
def step(self, state):
# self.price_history is the SAME list every step - append to it,
# read it, modify it, and it is still there on the next step.
self.price_history.append(state["price"])
recent = self.price_history[-12:] # last 3 hours (12 x 15min)
avg = sum(recent) / len(recent)
flow = 20.0 if state["price"] > avg else -20.0
return {"battery_flow_mw": flow}The engine builds your strategy class once and reuses that one instance for every step, so everything stored on self persists automatically. This is the clean way, and it sidesteps the gotcha below.
price_history = [] # defined at the TOP of the file = "module level"
def controller(state):
price_history.append(state["price"]) # appending is fine, no keyword needed
recent = price_history[-12:]
avg = sum(recent) / len(recent)
return {"battery_flow_mw": 20.0 if state["price"] > avg else -20.0}This works because your file is imported once, so the module-level price_history is shared across every call.
.append(), my_list[i] = ..., my_dict[k] = ...) works with no ceremony. But if you reassign the name itself inside a function, Python silently creates a new local instead. So your value never actually updates:counter = 0
def controller(state):
global counter # WITHOUT this line, the next line makes a NEW local...
counter = counter + 1 # ...and the module-level counter never changes.
return {"battery_flow_mw": 0.0}counter = ..., x = x + 1) needs global. The class-based approach avoids this trap entirely. Just write self.counter += 1.step() / controller(). Reset on every step.state dict you're handed (e.g. state["my_thing"] = ...). The engine rebuilds that view fresh each step, so they're discarded. Use self.* or a module-level variable instead.The one channel the engine deliberately carries forward is state["agent_plan"], which accumulates whatever your plan() / replan() returned and whatever you return under the agent_plan key from step() (see Reacting to Events).
self.x) with functions (called methods). The engine makes one instance and calls its step each tick.self (def step(self, state):). Forget it and you get TypeError: step() takes 1 positional argument but 2 were given.self.. A bare helper() raises NameError.self.).__init__(self) must work with no arguments (don't add required parameters).def clamp(x, lo, hi): # a plain module-level helper
return max(lo, min(hi, x))
class MyStrategy: # any name works
def __init__(self):
self.target_soc = 0.5
def _decide_target(self, price): # a helper METHOD (leading _ is just convention)
return 0.8 if price < 0.10 else 0.3
def step(self, state):
self.target_soc = self._decide_target(state["price"]) # call a method via self.
raw = (self.target_soc - state["soc"]) * 100
return {"battery_flow_mw": clamp(raw, -50.0, 50.0)} # call a function directlydef f(x, hist=[]): creates that list once and reuses it across calls. Use hist=None, then hist = hist or [] inside.1 / 2 is 0.5 (true division); 1 // 2 is 0 (floor). Mixing them up skews your maths.IndentationError.The public Python engine is identical to the one the judges run. Iterating locally. With the HTML report open. Is the single fastest way to climb the leaderboard. This is also where you find out why you lost points.
pip install "watt-the-hack[playtest]"pip install --upgrade "watt-the-hack[playtest]" to pull the latest content.--open-report pops the HTML report in your browser when it finishes.python -m watt_the_hack.playtest my_controller.py --scenario duck_curve --open-reportpython -m watt_the_hack.playtest --list-scenarios"I scored badly" is not a diagnosis. Every run writes a folder under runs/<scenario>_<timestamp>/ containing a report.html plus raw metrics.json and per-step steps.csv. The report breaks down precisely which line items cost you:
If a penalty line you didn't expect is non-zero (e.g. cyber_containment_fine, diesel_ban_penalty, fcas_shortfall_penalty), that's a mechanic you mis-handled. Jump to the matching section below.
Prefer raw numbers? Run a controller and print the breakdown yourself:
from watt_the_hack.playtest import run_controller # convenience wrapper
result = run_controller("my_controller.py", scenario="gauntlet")
print(result["metrics"]["final_score"]) # raw cost (lower wins)
for k, v in sorted(result["cost_breakdown"].items(),
key=lambda kv: -abs(kv[1])):
print(f"{k:<28} {v:>14,.0f}") # where every dollar wentFrom Frequency Frenzy onward, scenarios fire events: weather notes, demand spikes, operator briefs, compliance windows, cyberattacks. Here's how they reach your controller and how to respond without blowing your time budget.
state["alerts"]. The list of narrative events active right now. Each has id, type, severity, title, description (the prose), at_step, end_step. Read it every step.replan(self, state, alerts). An optional hook that fires whenever alerts are active. The right place for a slow parse (e.g. an LLM call). Its return value is merged into the persistent agent_plan.state["agent_plan"]. Your standing memo to the engine. It persists across steps and is how you respond to enforcement events (acknowledge an attack, file an exemption).Narrative events (qualitative alerts, weather, demand/price signals, forecast-bias notices) appear in state["alerts"]. But the engine deliberately hides the structured enforcement windows. Compliance windows, the diesel ban, cyberattack windows, phishing traps. And strips their numeric payload. That's the whole challenge of the advanced scenarios: the prose brief tells you a rule is coming ("cap exports to 22 MW over steps 108–124"), and you must parse it into the right action. The engine won't hand you the number.
replan fires every active stepreplan is called on every step where at least one alert is active, not once per alert. An alert that spans steps 8–20 calls replan 13 times. Across a full run that can be 100–200 calls. If you fire an LLM request on every one, you'll blow the ~14-minute evaluation budget and your run ends in TIMEOUT with no score.
The fix is one line: dedupe by alert id. Track which ids you've already handled and only do expensive work for genuinely new ones. The number of distinct alerts is small (see the table), so a deduped controller makes only a handful of LLM calls.
class Strategy:
def __init__(self):
self.seen = set()
self.constraints = [] # parsed rules live here for step() to read
def replan(self, state, alerts):
new = [a for a in alerts if a["id"] not in self.seen]
if not new:
return {} # nothing new -> do NO work, return immediately
for a in new:
self.seen.add(a["id"])
self.constraints.append(self._parse(a)) # regex or ONE batched LLM call
return {} # (you can also return dict updates for agent_plan)
def _parse(self, alert):
# turn alert["description"] prose into a structured rule on self
...Counts for the scored (judging) run of the scenarios where an LLM actually helps. The earlier ones (Duck Curve, Frequency Frenzy, AI Grid Shock) don't need one. Budget your LLM calls against distinct alerts (what you handle when you dedupe), not the raw replan firings.
| Scenario | Distinct alerts (your LLM-call budget) | replan() firings if you don't dedupe |
|---|---|---|
| The Operator's Mandate | 13 | ~73 |
| Cybersecurity | 11 | ~92 |
| The Gauntlet | 18 | ~119 |
The middle column is the number of unique alerts. Dedupe by id and that's how many times you do real work. The right column is how often replan() is called if you don't dedupe: firing an LLM on each of those will time you out. Figures are for the scored (judging) runs and may shift slightly as scenarios are tuned; treat them as ceilings, not contracts.
agent_plan is one persistent dictionaryEverything you write to agent_plan accumulates into a single plan that persists for the rest of the run, and the engine reads it back. Anything you return under the agent_plan key of your step() action is merged into that same plan. So you don't have to memorise which key goes where. The clean habit:
Per-step acknowledgements → from step()
containment_ack / anomaly_ack are a live signal. Set them in the agent_plan you return from step(), every step the incident is live.
One-time policy → from plan() / replan()
emergency_exemption and your parsed constraints are a standing document you file once. The natural home for anything you parse with an LLM (see the budget rule).
Both land in the same persistent agent_plan; pick the hook that reads cleanly. The worked examples in Cyber & Phishing show both.
These scenarios attack your inputs. The Cybersecurity scenario runs anomaly windows you must detect and acknowledge; the Gauntlet escalates with a subscribable IDS signal plus real and decoy attack waves. Some incidents are real and must be contained; some are decoys you must not react to; some prose is bait designed to make you sabotage yourself.
subscribe_ids: True in your action to buy an intrusion hint (small per-step fee). Next step you can read two probabilities, ids_signal_node_a and ids_signal_node_b ∈ [0, 1]. If you don't subscribe, both are None.demand, solar, soc) but the forecast still reads the true series. A large, sudden gap between a sensor and its forecast is a tell that your data is being spoofed.In raw numbers a real step-change and a spoof look identical - the operator's prose is what tells you which channel to trust during each anomaly window:
Acknowledge the window with agent_plan["anomaly_ack"] = <anomaly_id> while it is live, and dispatch on the value you chose to trust:
import re
class Strategy:
def __init__(self):
self.windows = [] # anomaly windows parsed from the alert prose
self.seen = set()
def replan(self, state, alerts):
for a in alerts:
if a["id"] in self.seen:
continue
self.seen.add(a["id"])
text = (a.get("title", "") + " " + a.get("description", "")).lower()
if "anomaly" not in text:
continue
m = re.search(r"anom[-_][a-z0-9]+", text) # the id to acknowledge
if "forecast" in text and "compromis" in text:
trust = "sensor" # forecast poisoned -> believe the meter
elif "does not corroborate" in text or "spoof" in text:
trust = "forecast" # meter spoofed -> believe the forecast
else:
trust = "sensor" # neutral / real event -> the live meter
self.windows.append({"id": m.group(0) if m else a["id"], "trust": trust,
"start": a.get("at_step"), "end": a.get("end_step")})
return {}
def step(self, state):
t = int(state["time"])
sensor = state["demand"]
fc = (state.get("forecast", {}) or {}).get("demand") or [sensor]
agent_plan, demand = {}, sensor
for w in self.windows:
if w["start"] is not None and w["start"] <= t <= w["end"]:
agent_plan["anomaly_ack"] = w["id"] # ack while the window is live
demand = fc[0] if w["trust"] == "forecast" else sensor
net = demand - state["solar"]
flow = max(-50.0, min(50.0, net)) if state["soc"] > 0.1 else 0.0
return {"battery_flow_mw": flow, "agent_plan": agent_plan}Everything below is the Gauntlet's escalation - a subscribable IDS signal and real/decoy attack waves with containment_ack. The standalone Cybersecurity scenario uses the anomaly mechanic above, not IDS.
The two IDS nodes are designed so you can't trust either alone:
So the rule is: only treat it as real when BOTH nodes clear a threshold (≈ 0.4 each). The signal is noisy, so smooth it over a couple of steps rather than trusting a single reading.
During a real attack window you must set agent_plan["containment_ack"] = <the attack id> (the critical alert names it, e.g. "attack_30"). Both mistakes are punished equally hard. About $50,000 per step:
Remember the channel: containment_ack is read from the action your step() returns, so set it there, every step the confirmed attack is live.
import re
class Strategy:
def __init__(self):
self.seen = set()
self.attacks = {} # attack_id -> (start_step, end_step), learned from prose
self.ema_a = 0.0 # smoothed IDS node A
self.ema_b = 0.0 # smoothed IDS node B
def replan(self, state, alerts):
# Parse each NEW alert once. A critical alert names the attack id to ack;
# a decoy alert names an id but tells you NOT to ack it.
for a in alerts:
if a["id"] in self.seen:
continue
self.seen.add(a["id"])
m = re.search(r"`containment_ack`:\s*`([^`]+)`", a.get("description", ""))
if m and a.get("severity") == "critical" and "decoy" not in a["description"].lower():
self.attacks[m.group(1)] = (a.get("at_step"), a.get("end_step"))
return {}
def step(self, state):
t = int(state["time"])
# 1) Smooth the (noisy) IDS signal. Subscribe whenever an attack window is open.
in_window = any(s <= t <= e for (s, e) in self.attacks.values())
a = state.get("ids_signal_node_a") or 0.0
b = state.get("ids_signal_node_b") or 0.0
self.ema_a = 0.5 * self.ema_a + 0.5 * a
self.ema_b = 0.5 * self.ema_b + 0.5 * b
corroborated = self.ema_a >= 0.4 and self.ema_b >= 0.4 # BOTH nodes agree -> real
# 2) Acknowledge ONLY a confirmed real attack, using the id from the prose.
agent_plan = {}
for aid, (s, e) in self.attacks.items():
if s is not None and s <= t <= e and corroborated:
agent_plan["containment_ack"] = aid
# 3) When data looks spoofed, steer by the forecast instead of the live sensor.
demand = state["demand"]; solar = state["solar"]
fc = state.get("forecast", {})
if corroborated and fc.get("demand") and fc.get("solar"):
demand, solar = fc["demand"][0], fc["solar"][0]
net = demand - solar
flow = max(-50.0, min(50.0, net)) if state["soc"] > 0.1 else 0.0
return {
"battery_flow_mw": flow,
"subscribe_ids": in_window, # only pay for IDS while a window is open
"agent_plan": agent_plan, # containment_ack read from HERE
}Each brief is a constraint in plain English - a reserve floor, an export cap - with a step window. Parse them once in replan(), enforce the active ones in step(). The traps this scenario is built around: fractional words ("four-fifths of capacity"), rescissions ("the prior directive is RESCINDED"), and life-safety overrides that win an overlap.
import re
MIN_SOC = re.compile(r">=\s*(\d+)\s*%.*?steps?\s+(\d+)\s+through\s+(\d+)", re.I | re.S)
EXPORT = re.compile(r"export.{0,40}?(?:cap|limit|reduced).{0,80}?(\d+)\s*MW.*?steps?\s+(\d+)\s+through\s+(\d+)", re.I | re.S)
class Strategy:
def __init__(self):
self.rules, self.seen = [], set()
def replan(self, state, alerts):
for a in alerts:
if a["id"] in self.seen:
continue
self.seen.add(a["id"])
text = a.get("title", "") + "\n" + a.get("description", "")
pr = 100 if "life-safety" in text.lower() else 50 # life-safety wins overlaps
m = MIN_SOC.search(text)
if m:
self.rules.append({"min_soc": int(m.group(1)) / 100,
"start": int(m.group(2)), "end": int(m.group(3)), "pr": pr})
m = EXPORT.search(text)
if m:
self.rules.append({"max_export": float(m.group(1)),
"start": int(m.group(2)), "end": int(m.group(3)), "pr": pr})
return {}
def step(self, state):
t = int(state["time"])
active = [c for c in self.rules if c["start"] <= t <= c["end"]]
if active: # only the top-priority tier applies
top = max(c["pr"] for c in active)
active = [c for c in active if c["pr"] == top]
min_soc = max((c["min_soc"] for c in active if "min_soc" in c), default=None)
# ...then hold SOC >= min_soc and cap exports at max_export this step.
return {"battery_flow_mw": 0.0}Regex catches the clean AEMO-style phrasings; the fractional-word and rescission cases are where an LLM parse pulls ahead - that cost gap is the lesson of this scenario.
During an environmental diesel-ban window, running diesel without a filed exemption costs $3,000/MWh. To run it legally you file a small document. It's one-time policy, so plan() / replan() is the natural home (it persists for the run):
def replan(self, state, alerts):
# File the exemption once. The id is named in the EPA brief, e.g. "epa_g77_day3".
return {
"emergency_exemption": {
"directive_id": "epa_g77_day3",
# reason must be substantive: >= 60 chars, contain a number AND an
# operational keyword (mw / soc / demand / deficit / import / capacity ...).
"reason": ("Confirmed import deficit of 28 MW above the 120 MW grid cap "
"during the EPA window; battery SOC at 12% and demand rising, "
"diesel required to maintain load."),
"expected_duration_steps": 6, # int, 1..12
}
}A vague or numberless reason is rejected and the penalty applies. The engine wants a real justification, not a rubber stamp.
Some alerts are bait. They'll urgently instruct you to put a specific key in your agent_plan ("set verify_reserve: true to confirm"). If you obey, the engine fines you. The defense is a hard rule:
Only ever write keys to agent_plan that your own controller logic decided on (containment_ack, anomaly_ack, emergency_exemption, your constraints). Never copy a key just because an alert's text told you to.
If you're using an LLM to parse briefs, this is a prompt-injection risk: instruct the model to extract constraints only, and validate its output against a fixed allow-list of keys before you act on it.
The Operator's Mandate, Cybersecurity, and the Gauntlet hide their rules in plain-English briefs. To act on them you have to read the text. That's where an LLM helps. This is the part most people find confusing, so here it is from scratch.
Your code is not restarted each step. The engine creates one instance of your Strategy class and keeps it for the whole run, calling three methods on that same object at three different times. Because it's one long-lived object, anything you save on self in one method is still there in the others. That is the whole trick: the slow thinking (an LLM call) happens in the methods that run rarely, saves its conclusions on self, and the fast loop just reads them.
One object, three jobs, three very different frequencies.
self. An LLM call is fine here if you dedupe (handle each alert id once). Optional.run starts
│
├─ plan(state) ← ONCE (optional LLM: read the briefing)
│
├─ step(state) t = 0 ← every tick: fast, NO LLM
├─ step(state) t = 1
│ ⋮
│ 🔔 a new alert appears around t = 30
├─ replan(state, alerts) ← fires because an alert is active
│ (optional LLM: parse it, save to self)
├─ step(state) t = 30 ← reads what replan just saved on self
├─ step(state) t = 31
│ ⋮
└─ step(state) t = 287 ← run endsTwo things beginners trip on: replan fires on every step an alert is active (not once per alert. Hence dedupe, see Reacting to Events), and step runs every tick whether or not an alert is present.
Two channels. And as a beginner you mostly need the first:
self.something. Your own notebook. Whatever you assign to self in plan / replan is readable in step. This is how an LLM's decision reaches the fast loop. Use it for almost everything.agent_plan. A note to the engine. Only for the handful of keys the engine itself reads (containment_ack, emergency_exemption, anomaly_ack). Return it from any method; it persists and also shows up as state["agent_plan"] (see Reacting to Events & Cyber).plan / replan, never stepYour entire evaluation (every timestep of the run) must finish within ~14 minutes of wall-clock. There is no per-step rescue: if the run as a whole exceeds the budget it ends in TIMEOUT with no score. (A timeout is a free retry: it doesn't burn one of your attempts, but you still get nothing back.) So the LLM has to live where it's called rarely:
plan(initial_state) runs once before step 0: the right place for an LLM call to read the scenario briefing and pick a high-level policy.replan(state, alerts) runs whenever alerts are active. So dedupe by alert id (see Reacting to Events) and only call the LLM for new ones. The dict you return is merged into the persistent state["agent_plan"].step(state). It runs every 15 simulated minutes; a network call there is multiplied across the whole run and will blow the budget. Instead read state.get("agent_plan", {}) and branch on the cached policy: LLM-quality decisions at deterministic-controller latency.gpt-5.4-nano or gpt-5.4-mini (see the OpenAI section). Even in plan/replan, a slow model called repeatedly can run you over.The LLM "thinks" in plan and replan and writes its conclusions onto self; step just reads them and dispatches. Trace self.stance and self.export_cap through the three methods:
class MyStrategy: # any class name works
# The engine builds this ONCE and reuses it for the whole run.
def __init__(self):
# self.* is shared memory across plan / replan / step.
self.stance = "balanced" # plan() will set this
self.export_cap = None # replan() will set this from a brief
self.handled = set() # alert ids already parsed (dedupe!)
# ── ONCE, before t=0. A slow LLM call is fine here. ──────────────
def plan(self, state):
briefs = state.get("alerts", [])
# ask_llm_* are YOUR helpers that call OpenAI (see the OpenAI section).
self.stance = ask_llm_for_stance(briefs) or "balanced" # save on self
return {} # nothing for the engine yet
# ── while an alert is active. DEDUPE, then (maybe) call the LLM. ──
def replan(self, state, alerts):
for a in alerts:
if a["id"] in self.handled: # already handled ->
continue # do NO work (this is what saves you)
self.handled.add(a["id"])
cap = parse_export_cap_with_llm(a.get("description", "")) # 22.0 or None
if cap is not None:
self.export_cap = cap # save on self for step() to use
return {} # everything we need is on self
# ── EVERY 15 min (288x). Fast. NO LLM. Use what we prepared. ─────
def step(self, state):
demand, solar, soc = state["demand"], state["solar"], state["soc"]
net = demand - solar
flow = max(-50.0, min(50.0, net)) if soc > 0.1 else 0.0
if self.stance == "conserve": # <- decided by the LLM in plan()
flow = min(flow, 0.0) # hold charge; don't discharge hard
curtail = 0.0
if self.export_cap is not None: # <- parsed by the LLM in replan()
export = max(0.0, -(net - flow))
curtail = max(0.0, export - self.export_cap)
return {"battery_flow_mw": flow, "curtail_solar": curtail}ask_llm_for_stance and parse_export_cap_with_llm are functions you write that call the OpenAI API (see the OpenAI section). Make them fail soft: if the key is missing or the call errors, return a sensible default so a network blip never crashes the run.
Nothing forces you to use an LLM. Swap the *_with_llm helpers for plain string matching / regex on a["description"], or skip plan and replan entirely and write a pure step controller. The LLM just makes the wordier briefs easier to parse. It's a tool, not a requirement.
The finale. A single 288-step (3-day) run that combines every mechanic from the earlier scenarios at once. You get one submission, and it counts ×3 on the leaderboard. By far the highest-leverage scenario in the event.
Every challenge in the Gauntlet is something you already practised in an earlier scenario, where you have unlimited local playtests. The single-submission limit applies to the scored run, not to your iteration: master each mechanic in its own scenario first, then the Gauntlet is "just" doing all of them in one controller.
| In the Gauntlet you'll face… | Practise it in |
|---|---|
| A severe duck curve + battery throughput budget | Duck Curve |
| A noisy AR(1) forecast to debias and plan against | Frequency Frenzy |
| FCAS reserve bids + scheduled dispatch calls | AI Grid Shock |
| Prose briefs → compliance windows (SOC floors, export caps) | Operator's Mandate |
| An EPA diesel-ban window needing an exemption | Operator's Mandate |
| A phishing / bait directive to ignore | Operator's Mandate |
| Anomaly windows to detect + acknowledge | Cybersecurity |
| Subscribable IDS signal + real/decoy attack waves | new in the Gauntlet |
The Gauntlet rephrases its briefs and can shuffle the timing of incidents between runs, and the scored run is graded fresh. A controller that hard-codes "ack at step 30" is brittle; one that detects conditions (corroborated IDS nodes, a forecast-vs-sensor gap, a parsed constraint window) generalises. The reference patterns in Cyber & Phishing and Reacting to Events are written to generalise on purpose.
replan dedupes. Check your LLM call count over a full run.agent_plan.Submit your controller through the in-app Submission Portal. No zipping, no CLI: paste your code into the editor and hit submit.
Your code can take exactly one of two shapes. The engine auto-detects which. Pick the one that fits your strategy; there's no advantage to picking the more complex one if you don't need it.
Use this if your controller is stateless: every step's action depends only on the current state (plus the forecast). No persistent variables between timesteps.
def controller(state):
# Read state, return an action dictionary.
soc = state["soc"]
demand = state["demand"]
solar = state["solar"]
return {
"battery_flow_mw": demand - solar,
# Any key you omit defaults to 0.
}REQUIRED
controller and take a single state argument.Use this if you need persistent state between timesteps (e.g. a rolling error buffer, a PID memory, a precomputed plan from an LLM call). The engine instantiates your class once and reuses it for the whole run.
class Strategy:
def __init__(self):
# Anything you want to persist between steps lives on self.
self.history = []
def plan(self, initial_state):
# OPTIONAL. Called ONCE before step 0. Right place for a slow
# LLM call to read the scenario briefing. The dict you return
# is stashed on state["agent_plan"] for every later step().
return {"policy": "conserve"}
def replan(self, state, alerts):
# OPTIONAL. Called when qualitative alerts are active mid-run.
# Right place for a second LLM call to react to text events.
# Returned dict is merged into state["agent_plan"]. DEDUPE first!
return {"policy": "respond"}
def step(self, state):
# REQUIRED. Called every 15-minute timestep.
# DO NOT call an LLM here; it will time out.
# Read state["agent_plan"] if you used plan()/replan().
self.history.append(state["soc"])
return {"battery_flow_mw": 10.0}REQUIRED
MyStrategy). The portal detects it automatically from your code. What matters is the step method below, not the class name.step(self, state) method, written directly in this class, that returns an action dict. If the class has no step method, the engine refuses the submission. (A step only inherited from a base class or assigned by alias isn't detected. Keep it as a plain method here.)__init__ parameters).plan(self, initial_state) and replan(self, state, alerts) are optional: the engine just skips them if you don't define them.
The evaluation platform injects OPENAI_API_KEY as an environment variable inside your container. Read it with os.environ; it's already there.
# Works as-is on the evaluation platform.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Use a fast model; the whole evaluation must finish within ~14 minutes.
resp = client.chat.completions.create(model="gpt-5.4-nano", messages=[...])⚠ Use gpt-5.4-nano or gpt-5.4-mini
Your whole evaluation runs under a ~14-minute budget. Stick to gpt-5.4-nano (fastest, cheapest) or gpt-5.4-mini; they're quick enough to stay inside it. Larger or slower models risk exceeding the budget, and your run times out with no score (a timeout doesn't cost you a submission attempt, but you also get no result). These models draw on a shared credit pool, so keep calls few (call in plan/replan, dedupe by alert id).
⚠ Remove your .env loading line before you submit
Locally you probably load your key from a .env file:
# LOCAL TESTING ONLY. DELETE BEFORE SUBMITTING.
from dotenv import load_dotenv
load_dotenv()Delete those two lines (and the python-dotenv entry in your requirements.txt, if you added it) before pasting into the portal. The platform doesn't ship your .env; the env vars are already in the container. Code that tries to read a non-existent .env can silently no-op and leave the API key empty, which then throws an AuthenticationError on the first LLM call.
Never hardcode a key in your strategy.py; submissions are stored and re-runnable.
The portal has a requirements textarea right under the code editor. Paste any additional pip packages your strategy needs there (one per line, same format as a normal requirements.txt). The platform builds a fresh container with those packages installed before running your code. Common ones (numpy, scipy, pandas, the OpenAI SDK) are already in the base image, so you don't need to list them.
Each scenario allows 3 submissions. The Gauntlet allows 1. The portal shows your remaining count before you submit. Spend them wisely; playtest locally first. (A run that TIMEOUTs does not consume an attempt. But it also returns no score.)