Data Fights Back: Bitget Radar Fails (Part 2)
The lead–lag dream: when the idea sounds very “quant”
After staring at volatility and order books long enough, I fell into a very quant-flavored idea:
If the same pair trades on both Binance and Bitget, why not use both order books to predict volatility on Bitget?
The logic feels clean. In theory, prices on the two venues should move together. In practice, latency, order flow, fees, funding, and client mix can create tiny timing differences. If Binance tends to react first and Bitget follows a beat later, then Binance becomes an “early warning” signal for Bitget.
It even sounds slick in a team chat: “I’ll watch both books. When Binance starts shaking, I’ll use that to build a volatility radar for Bitget.”
The paper plan was simple: align both exchanges on a common time grid, resample everything to a shared 1-second timeline, engineer cross-exchange features (price gap, spread, depth differences), and let a lightweight model learn the mapping. No LSTMs, no Transformers, just tree-based models like Random Forest or XGBoost.
On paper, it was elegant. Reality disagreed.
Ambitious setup, tiny dataset
At that point, the total amount of orderbook history I had was only a few hours of real trading. Not days, not weeks – hours. The number of “big move” events inside that window was small.
To avoid cheating, I split the data by time into three blocks:
- First block: training.
- Middle block: validation.
- Last block: test.
The task was: at each time t on Bitget, using the state of both Bitget and Binance orderbooks, predict whether there will be a sufficiently large move on Bitget in the next 10 seconds.
The label was defined much like in Part 1: look at the last traded price on Bitget from t+1s to t+10s, and compute the 10-second range:
If this range is greater than or equal to some threshold (say 0.002), label that time tas 1 (“event”). Otherwise, label it 0 (“normal”).
Features were a mix of:
- prices, spread, depth on each side,
- imbalance between bid and ask,
- short-term returns,
- price gaps between Binance and Bitget,
- depth differences between the two books.
Looking at the feature table, you’d think: “If there is any signal, the model should find something.”
So I started with some tree-based models, lightly regularized. On the training block, the metrics looked great: high ROC–AUC, PR–AUC clearly above baseline, precision and recall looking promising.
This is the most dangerous moment in any ML project: the screen is green, the plots look nice, and you start to believe you’ve found alpha.
Then you run the model on validation.
When the model learns “too well”: overfitting and the train–val cliff
On the validation block, everything sank. PR–AUC dropped sharply, precision worsened, recall became unstable depending on the threshold. On test, it was even worse: the model behaved barely better than a clever random guess.
We throw around the word overfitting a lot in ML, but seeing it on live-like data hits differently than in a toy tutorial. The model wasn’t “dumb”. It had learned some genuine patterns. But most of what it learned were details very specific to the first block of time – not stable patterns of how volatility behaves.
Looking more closely, a few problems popped up that I had conveniently ignored at the start.
First, the total span of data was simply too short. A few hours of orderbook data are not enough to learn robust behavior about volatility, especially when you’re trying to capture cross-exchange lead–lag.
Second, the label distribution was quite different across the blocks. One block had something like 12% event labels (1s). Another was down at 3%. The base rate of events was changing across time blocks.
Third, the market regime itself was different across blocks:
- One block concentrated on a “crazy” period with high volume, news and strong moves.
- Another block was quiet, low volume, more sideways.
So when you train the model on block A and test on block B, you are effectively asking it to learn “Monday weather” and then forecast “Wednesday weather”, except Monday was a storm and Wednesday was a sunny, boring day. No matter how good the model is, if you haven’t seen enough variety in training, it will struggle.
Dirty data: websockets, duplicate timestamps, and frozen quotes
On top of the limited time span, the quality of the data was also silently sabotaging the model. Websockets in real life are not as polite as in the docs. In the raw streams, you will see:
- repeated timestamps back-to-back,
- missing updates for tens of seconds,
- bursts of updates arriving after a connection hiccup.
When you resample everything to 1-second bars, you’re forced to make choices:
Do you forward-fill the last known orderbook when there is no fresh tick in that second?
Or do you drop all such timestamps and accept gaps in your 1-second grid?
If you forward-fill everything, you are implicitly telling your model: “During this entire interval, the orderbook never changed; the market was asleep.” But in reality, it might have been a network issue, a subscription hiccup, or your client’s bug, while the real exchange continued trading.
If you drop every second without an update, you introduce holes in the timeline. Now some of your 10-second windows are built on incomplete snapshots – chunks of the market movie are simply missing.
Later, when I inspected Binance-only data more carefully, I found something uncomfortable: some 300–500 second flat stretches in last_price were real. Altcoins during dead hours simply don’t trade. The orderbook may wiggle a bit, but trades don’t happen and last_price sits still.
If you decide “every long flat area must be a bug”, you’ll end up throwing away real, important low-volatility regimes.
In the Bitget + Binance experiment, I hadn’t taken this seriously enough. I had forward-filled, but I didn’t track how far the resampled point was from the last real update. There was no “gap length” feature. In the model’s eyes, a timestamp with a fresh tick and a timestamp 20 seconds after the last real update looked identical — there was no way for it to tell fresh quotes from stale ones.
A simple idea that I didn’t use back then, but later became crucial, is defining:
and either:
- use it as a feature, so the model knows when the book is “stale”, or
- filter out points where gap_len is too large (treat them as blind zones).
Without this, the model was being fed a mix of real states and “stale copies” and asked to treat them equally.
Labels, gaps, and why “logically correct” can still be “practically bad”
When I first designed the label, the logic looked clean: look at Bitget price, compute the 10-second range, compare against a threshold, assign 0 or 1. It matches how a trader thinks about being stopped out by noise.
But two deeper issues were hiding here.
First, some of the 10-second future windows were built over low-quality or stale data. If the orderbook wasn’t properly updated or the last_price was frozen due to a client-side issue, the computed range could be misleading. In other words, I was creating labels over windows that I wouldn’t trust myself if I actually looked at the raw quote stream.
Second, in a cross-exchange problem, the label was entirely based on Bitget, while the features mixed Bitget and Binance. Sometimes Binance led, sometimes lagged, sometimes decoupled. Without enough hours and regimes, the model had an easy way out: ignore the cross-exchange complexity and just memorize Bitget’s own local quirks in that tiny training window.
So in theory the label definition was logical. In practice, it was feeding the model a noisy, inconsistent truth.
Later, when I switched to Binance-only and redesigned the pipeline, I was much stricter:
- forward-fill, and compute gap_len;
- filter out timestamps where either the current point or any point in the next 10 seconds had gap_len exceeding some guardrail (e.g., 5 seconds);
- only then compute the 10-second range and assign labels.
That way, label generation happens only on windows where you can reasonably trust that the data is “live” and continuous.
Looking back at the Bitget + Binance setup, if I had to redo it, I’d first:
- fully separate “trusted windows” from “blind zones” based on gap_len,
- maybe focus on a single exchange first until the pipeline is solid,
- and only then revisit cross-exchange features when I actually have enough continuous, clean hours from both sides.
Label imbalance and the subtle trap of time-block splits
Another softer, but very real issue was how I split train/val/test.
Splitting by time blocks is conceptually correct for time series. It mirrors the real deployment scenario: you train on past data, then predict on unseen future periods.
But when your data span is very short, slicing it into three blocks can accidentally produce:
- a training block sitting on a “hot” regime with more events,
- a validation or test block sitting on a “cold” regime with fewer events.
Concretely, one block might have a label-1 proportion around 12%, another block around 3%. Different base rates mean even a good model will show different PR–AUC across blocks. If you don’t pay attention, you can misinterpret this as “the model suddenly became terrible”, when part of it is simply the event rate changing.
In the Bitget + Binance project, I saw a large train–val performance gap and initially blamed weak models or poor hyperparameters. The real cause was simpler:
- too few hours overall,
- too aggressive splitting into 3 blocks,
- and regimes that changed between those blocks.
With such little data, the model didn’t have a chance to see enough diversity in training to generalize across regimes.
One lesson I pulled from this:
When you are still in the proof-of-concept phase with limited data, it can be better to narrow the scope instead of widening it. Work with one exchange, one pair, and a couple of days that are representative, build a stable pipeline there, and only then expand to cross-exchange or multi-regime setups once you have more history.
Conclusion
If I summarize the Bitget + Binance experiment in one sentence, it would be:
I tried to solve an ambitious cross-exchange volatility problem with data that was too short, too dirty, and too unevenly distributed.
I tried to catch lead–lag between two exchanges using:
- only a few hours of history,
- websocket-driven data with gaps, duplicates and stale stretches,
- labels defined on top of windows that sometimes weren’t trustworthy,
- time-block splits that landed in very different regimes, with different event rates.
The models weren’t inherently weak. They did their job: they extracted patterns from the training block. The problem was that the training block simply didn’t represent the reality of the validation and test blocks.
Everything still looked “scientific”: we had a pipeline, features, an ML model, PR–AUC, train–val–test split. But seen in context, it becomes obvious:
Machine learning doesn’t rescue a badly posed problem. It magnifies your mistakes and hides them under nice-looking metrics.
In Part 3, we tighten the sandbox and rerun the whole idea under much stricter rules. Everything stays on one exchange, Binance, and we focus on a single pair such as CAKEUSDT. The 10-second event definition is kept explicit and consistent, gaps and stale quotes are handled deliberately instead of being silently forward-filled, and features are built only on clean, resampled 1-second data. Most importantly, evaluation doesn’t stop at train/validation/test inside the same two days. The model is also pushed onto a separate out-of-sample day, where it has to survive conditions it has not “seen” before.
That’s where the simple question from Part 1:“Will there be drama in the next 10 seconds?” finally meets a setup where machine learning can give an answer that isn’t just pretty in a notebook, but robust enough to be taken seriously in a trading context.