Yes, that staleness gap is exactly where we burn out quickly. The planner chooses something like “Click on element 47” and by the time it runs, the page has re-rendered and 47 is now a completely different button.
What we do in OpenClick is basically two layers.
Within a batch: Each AX action (click, type, etc.) re-solves the target just before execution using a fresh AX snapshot, not the one the planner saw. We never rely on element indices. Everything is matched via __ax_id, title, or more static signals like role + frame.
If the state of an action is likely to change, we force an AX refresh before the next step, as this is where things usually go astray. Pixel renders are just a fallback for things like Canvas or WebGL where AX is useless.
Between batches: We take a fresh screenshot and AX snapshot, then run a validator model that checks whether what we expected actually happened. If not, or only partially, we replan with the new state and do a brief review of what happened.
So we don’t really trust plans for more than one batch, and for this reason we keep batches small (usually 3-5 actions).
Honestly, the toughest cases now are not AX drift, but apps that expose AX inconsistently or lazily. Gmail is a classic. Message lines can be weird, so we sometimes force AX to refresh right before clicking on them. Otherwise you’ll get cases where a code click “works” but the row is never actually activated.
Curious to know what approach you ended up taking here.
<a href