Stop Guessing: When Different Models Agree on the Same Stop Level

The Question

I’ve been working on replacing static stop rules with a data-driven, regime-aware stop framework.

The core question is simple:

At what point does a trade become statistically unlikely to recover?

Rather than guessing, I modeled probability of recovery as a function of MAE-to-date and market context, and then asked multiple models the same question.

What surprised me was not the answer — but how consistent it was.

The Setup

Data

Instrument: ES (1-minute bars)
Regime: PA-FIRST
Sample size: 639 trades
Features:
- MAE-to-date (true adverse excursion using high/low)
- ATR (1m)
- Distance from EMA
- EMA slope
- Minutes in trade
- Regime (one-hot encoded)

Label

A trade is considered recovered if final PnL > 0.

The models predict:

P(recover | current MAE-to-date, context)

Models Compared

I trained and evaluated two nonlinear models:

1. Random Forest

Strong baseline
Handles nonlinearity and interactions
Often criticized for instability

2. Gradient Boosted Trees (Histogram-based)

Faster convergence
Strong bias control
Often outperforms RF on tabular data

Both models were trained identically:

Grouped by trade_id (no leakage)
Same features
Same probability threshold extraction logic

How the Stop Level Is Derived

Instead of using the model output directly, I apply a policy extraction step:

Predict P(recover) at each 1-minute snapshot
Bin snapshots by MAE-to-date
Find the first MAE level where:

mean P(recover) < 0.20

Use that MAE as the model-derived max stop level

This turns a probabilistic model into a deterministic, auditable risk rule.

The Result

Both models independently produced the same stop level:

Regime	Threshold	Max Stop (pts)	Observations
PA-FIRST	P(recover) < 0.20	9.9	639

This is remarkably close to the heuristic I had previously derived by hand:

Caution zone ≈ 9.5 pts
Hard failures accelerate ≈ 10–11 pts
Kill switch ≈ 12 pts

Why This Matters

When different model families agree, it usually means:

The signal is structural, not model-specific
MAE-to-date is the correct axis
The decision boundary is stable
The result is unlikely to be a coincidence

In other words:

This stop level is being discovered, not fit.

Design Implications

In live trading, this becomes:

Model exit: MAE-to-date ≈ 9.9 pts
Hard kill switch: 12.0 pts (safety backstop)
Execution floor: small buffer (e.g. 0.5 pts) to avoid noise

The model doesn’t replace discipline — it quantifies it.

A Subtle but Important Insight

This approach does not require loading models in production.

The models are used offline to learn regime-conditioned stop policies, which are then written to a database and consumed by the live execution engine.

That keeps live systems:

simpler
safer
easier to reason about

What’s Next

This was just PA-FIRST.

The real test (and likely divergence) comes with:

ATM-FIRST trades
higher volatility regimes
time-conditioned policies (early vs late trade)
asymmetric logic (tighten vs exit)

But the takeaway stands:

If Random Forests and Gradient Boosting agree on the same stop level, the market is telling you something worth listening to.

This post is part of an ongoing effort to replace intuition-driven trading rules with observable, testable system behavior.