From GARCH to Orderbook ML: Building a 10-Second Volatility Radar (Final Part)
Resetting the game: one exchange, one pair, one clear question
After the messy Bitget + Binance attempt in Part 2, let’s stop being a hero and act like a responsible quant for once. That meant shrinking the sandbox until both the data and the question became crystal clear.
So for this part, the setup is:
- One exchange: Binance
- One pair: CAKEUSDT
- One dataset: a 1-second orderbook file with columns like ts_iso, best_bid, best_ask, last_price, best_bid_qty, best_ask_qty, bid_notional_5, ask_notional_5
- Main training window: two days, 25–26 November
- One untouched out-of-sample day: 27 November
However, the trading question stays the same:
“At time t, will the price move enough in the next 10 seconds to matter?”
In other words: Is the 10-second range big enough to hit a tight stop, blow out a narrow spread, or generally troll a short-term trader?
Everything else in this part—GARCH, ML, orderbook features—is just a different way to answer that one question.
Data cleaning: 1-second resample, stale quotes, and trusting your labels
Orderbook data doesn’t arrive nicely once per second like a polite API. Sometimes you get multiple updates in one second, sometimes none for 20 seconds, sometimes your websocket just takes a nap.
To make any model meaningful, you first need a consistent time grid and a sense of how fresh each quote is. You can try this pipeline:
- Resample the raw stream to 1-second bars and forward-fill prices and quantities. That gives you a neat time series where every second has last_price, best_bid, best_ask, depth, etc.
- For each second, mark whether there was a real tick in that second (has_tick = 1) or whether you just carried forward the previous quote (has_tick = 0).
- Compute a “staleness” measure called gap_len = number of seconds since the last real tick.
- If gap_len(t) = 0, the quote at time t is fresh.
- If gap_len(t) = 8, you haven’t seen a new update for 8 seconds, even though you are still filling values forward.
- Only keep timestamps where both:
- the current gap_len is small enough (for example, ≤ 5 seconds), and
- the maximum gap_len in the next 10 seconds is also ≤ 5 seconds.
That second condition matters. If you want to build a label about what happens from t+1 to t+10, you should refuse to do it on a future window where half of the quotes might be stale.
The label itself is defined in a trader-friendly way. For each time t, you look at last_price from t+1 to t+10 and define:
If this 10-second range is at least, say, 0.002 for CAKEUSDT, you label that timestamp as:
- 1 (event): “there was a meaningful 10-second move after t”
- 0 (non-event): “nothing dramatic happened”
After filtering out all the stale parts using gap_len and then assigning labels, the 25–26 November window ends up with about 172k one-second samples, with roughly 4% labeled as events. That’s rare enough to be difficult, but not so rare that everything becomes hopeless.
GARCH: a quick crash course and a fair baseline
Before letting machine learning loose, I wanted to give a classic volatility model a fair shot: GARCH.
You can think of GARCH (Generalized Autoregressive Conditional Heteroskedasticity) as a way to model how volatility clusters in time. It doesn’t try to predict direction; it tries to answer: “how wild is the next move likely to be, given recent behavior?”
First, you take 1-second log-returns of the price:
So in this project, GARCH becomes:
- input: the return series,
- output: a volatility score,
- decision rule: “if volatility score is high, I expect range_10s(t) to cross my threshold; if low, I expect calm.”
On the 25–26 November window, you get roughly:
- base event rate ≈ 4%,
- ROC–AUC ≈ 0.80,
- PR–AUC ≈ 0.16.
That may look small, but with a 4% base rate, 0.16 PR–AUC is solid. GARCH is clearly doing better than random, and it highlights noisy zones that genuinely tend to contain more 10-second moves.
But GARCH is blind to the orderbook. It only sees the price series. Depth, spread, imbalance—everything microstructural that a market maker stares at all day—is invisible to it.
That’s exactly where ML has a chance to shine.
How ML models actually make a decision
People love to say “we used ML” as if it’s some mystical cloud. In reality, most of the models that worked well here are very concrete:
- Logistic Regression: a linear model that outputs probabilities using a sigmoid.
- Tree ensembles (Random Forest, XGBoost, LightGBM): many decision trees averaged together.
Let’s unpack how they turn features into “yes/no” signals.
Logistic Regression: weighted sum → probability → threshold
For each timestamp t, you build a feature vector Xt : things like last_price, spread, imbalance, short-term returns, time-of-day, etc.
Logistic Regression computes:
then passes this through a sigmoid:
where τ is a threshold you choose depending on how scared you are of false alarms vs missed events.
Because it’s linear in the features, Logistic Regression is quite interpretable: you can inspect w to see which features push the probability up or down.
In this project, it already beats GARCH strongly: ROC–AUC around 0.94, PR–AUC around 0.5+, just by combining microstructure and short-term returns in a simple way.
Tree ensembles: slicing feature space into regions
Tree-based models go one step further. Instead of fitting a straight line in feature space, they recursively split the space into regions where the event rate is relatively homogeneous.
A single decision tree works like this:
- At the root, you choose a feature and a threshold, e.g. “spread ≤ 0.0012?”.
- If yes, go left; if no, go right.
- At the next node, split again, maybe on depth imbalance or recent return.
- Eventually you reach a leaf node: all training points that landed there have some empirical event frequency, say 12%.
That leaf’s event frequency becomes the predicted probability for any new point that follows the same path.
Tree ensembles—Random Forest, XGBoost, LightGBM—repeat this many times with slightly different trees:
- Random Forest: many trees trained on bootstrapped samples, decisions are averaged.
- XGBoost / LightGBM: build trees sequentially; each new tree focuses on correcting the mistakes of the previous ones (gradient boosting).
In the end, for each timestamp t, you get a blended probability:
The difference compared to Logistic Regression is that trees can learn nonlinear interactions. For example:
- “If spread is tiny and both sides of the book are very thick and recent returns are flat, vol is unlikely.”
- “If spread suddenly widens and best ask volume just vanished and there was a small up-move, a breakout is more likely.”
You don’t have to hard-code those rules; the model infers them by trying many splits and keeping the ones that reduce classification error or improve some loss.
Orderbook features vs GARCH: who wins on 25–26 November?
With this machinery in place, here’s what happens on the test portion of the 25–26 November window (still within those two days, but unseen during training):
- GARCH, using only price, reaches PR–AUC around 0.16.
- Logistic Regression, using orderbook features + short-term returns, jumps to PR–AUC around 0.5+.
- Tree-based models (Random Forest, XGBoost, LightGBM) push PR–AUC up to roughly 0.60 or higher.
ROC–AUC for the trees stays in the 0.93–0.95 range, which is strong but less informative than PR–AUC given the low base rate.
To make this more concrete, take LightGBM as an example and use a threshold around 0.5 on the test split of 25–26 November:
- Precision ≈ 0.54
- Recall ≈ 0.70
- F1 ≈ 0.61
So out of all the times the model shouts “there will be a 10-second move”, it’s right a bit more than half of the time, and it catches roughly 70% of all real events.
Compared to GARCH, you’re no longer just getting a generic “volatility is elevated” warning. You’re getting a targeted radar that uses:
- the shape of the orderbook,
- short-term micro-moves,
- and time-of-day effects
to pinpoint where turbulence is most likely to show up.
The real exam: a fresh day with different behavior (27 November)
Nice metrics inside the same two days are comforting but not enough. The real question is:
If you train everything on 25–26 November and freeze the model, can it still deliver on 27 November, with its own distribution of calm and storm?
On 27 November, after applying the same gap rules:
- you end up with about 86k seconds of data,
- the event rate drops to about 2.7% (fewer big 10-second moves).
Even so, the LightGBM model trained on 25–26 November holds up very well:
- ROC–AUC ≈ 0.94
- PR–AUC ≈ 0.62, despite the lower base rate
And if you keep the same threshold of 0.5 you used on 25–26 November:
- Precision ≈ 0.59
- Recall ≈ 0.59
- F1 ≈ 0.59
So on a completely new day with fewer events, the model still manages to:
- flag a relatively small subset of timestamps as “volatile soon”,
- and pack a large fraction of real events into that subset.
At this point, the model isn’t just fitting a particular day. It’s learned something more stable about how CAKEUSDT’s microstructure behaves before short-term volatility spikes.
GARCH, in comparison, sees some of these zones but misses many others—especially those where the orderbook starts flashing warning signs before the price itself has exploded.
What this really means for traders (and for future ML work)
After three parts, a few lessons feel robust enough to survive outside this notebook:
First, ambition must match data.
The Bitget + Binance experiment was conceptually exciting, but the data was too short and too messy. With only a few hours of history, you’re better off doing one exchange well than two exchanges badly.
Second, GARCH is a respectable baseline, not a relic.
It doesn’t win this specific 10-second volatility prediction contest, but it sets a meaningful floor. Any ML setup that can’t beat GARCH on a volatility problem should probably be fixed or thrown out.
Third, orderbook microstructure is a goldmine for short-horizon vol.
Features like depth imbalance, spread behavior, recent tiny returns, and “how long the book has been frozen” clearly add predictive power that pure price-based models just don’t have.
Fourth, labels and data hygiene matter as much as architecture.
A clean, trader-relevant label (“10-second range above my stop size or not”), strict handling of stale quotes, and sensible time-based splits did more for this project than any exotic model choice would have.
Finally, you don’t always need deep learning to get real edge.
Logistic Regression plus tree ensembles (XGBoost, LightGBM) were already enough to turn a 10-second volatility question into a usable radar, both in-sample and out-of-sample. Before reaching for LSTMs or Transformers, it’s usually smarter to double-check your data, features, and labels.
Is this the final word on short-term volatility prediction? Of course not. With more days, more pairs, and regime-aware models (news days vs normal days, high vs low liquidity sessions), you can push this much further.
But even at this stage, the experiment supports a very practical conclusion:
If you respect your data, design labels that match your trading pain, and let ML look at the right orderbook features, then yes: you can build a 10-second volatility radar that clearly outperforms GARCH and stays alive on a new day.
And for a trader, having a machine quietly whisper “the next 10 seconds are likely to be spicy” is sometimes all you need to place a better order—or to decide not to place one at all.