The Kayfabe Problem What happens when you train a machine-learning model on 482,166 scripted fights A few weeks ago I shipped a thing I'd been threatening to ship for a year: an end-to-end machine-learning system trained on four decades of professional wrestling. The dataset is up on Kaggle, mirrored on Hugging Face, with a trained model card and source on GitHub. 482,166 matches. 12,814 wrestlers. Six promotions β€” WWE, AEW, WCW, ECW, NXT, TNA. Coverage from 1980 to now. Here's the thing almost nobody flags when you tell them you trained a model on pro wrestling: the matches are scripted. Outcomes β€” who wins, who loses, who turns heel in the third act β€” are decided in advance by booking writers and executed by performers. The label your model is trying to predict isn't an athletic measurement. It isn't a behavioral signal. It's creative output. A room of writers in Stamford, Connecticut decided it. That's the kayfabe problem. It is the most interesting thing about this dataset, and it ate most of my year. What kayfabe means in data terms Kayfabe is wrestling jargon for the convention of presenting scripted events as real. Translated into data-modeling language, it has a precise meaning: your prediction target is generated by a writers' room rather than by athletic competition. That single fact reshapes everything downstream. Three things follow immediately. One: a model can never learn athletic skill from this data, because nothing in the dataset measures it. Two equally-booked wrestlers will produce identical feature vectors regardless of who is the better worker. Two: the model can absolutely learn booking patterns β€” who gets pushed, who's a jobber, which feuds end at PPV, which wrestlers are being escalated toward a planned payoff. Three β€” the one that bites you in evaluation: the label is autocorrelated with itself. A wrestler on a five-match win streak is being booked toward something. Their next-match outcome is not independent of the previous five. Storylines persist. The data has memory. If outcomes were random athletic results, every active wrestler's career win rate would cluster near 0.5 with normal-distribution variance. The empirical distribution looks nothing like that. It's bimodal β€” heavy left lobe (jobbers booked to lose), heavy right lobe (stars booked to win), thin middle. Eighteen percent of qualifying wrestlers have a career win rate above 0.7. Fourteen percent are below 0.3. A coin flip would predict five percent in each tail. The shape of that histogram is the entire problem in a single picture. I call it the kayfabe signature. The 25-point gap I trained two models on 35 features β€” recent form, head-to-head record, match context, title proximity, alignment, momentum, the usual suspects. A logistic-regression baseline and an XGBoost primary. XGBoost lands at 0.718 AUC on a true future hold-out. Logistic regression at 0.698. Both meaningfully clear the coin-flip floor and beat a naive favored-wrestler-always-wins baseline that scores around 0.62. That's the honest number. But here's the part that took me weeks to accept: on the validation set, that same XGBoost model scores 0.952. Twenty-five AUC points higher. That isn't a hyperparameter tuning problem. That isn't a leakage bug I can fix. That's the kayfabe problem propagating through the temporal split. Storylines persist across calendar boundaries. Matches in December 2024 are deeply informative about matches in November 2024 β€” same wrestlers, same feuds, same alignment state. They're only weakly informative about matches in June 2025, after storylines have turned over and writers have moved on. My first instinct was that the gap was a bug. After enough ablation studies I realized it's a feature β€” a structural property of any prediction problem where the label generator has memory and the memory decays at a rate I can't see. Booking arcs are autocorrelated. Validation that doesn't account for that overstates real-world performance, and there's no version of this problem where it doesn't. Reporting the gap honestly turned out to be more valuable than reporting a closer val/test agreement that I could only have produced by introducing leakage. Why this isn't really about wrestling The chaos here is that the outcomes look like real fights and aren't. The order is that writers follow templates, and the templates are visible if you have enough matches. Streak features alone carry over half the model's signal. Remove the five momentum features and test AUC collapses from 0.718 to 0.541, barely above coin-flip. The model isn't extracting many small signals; it's extracting one large signal β€” booking momentum β€” that the other thirty features modestly refine. That's a useful pattern beyond pro wrestling. Any system where humans decide outcomes through narrative templates β€” A/B test winners chosen by a head of growth, hiring panels, content moderation, anything where someone is escalating toward a planned payoff β€” has the same shape. Streak features dominate. Random splits leak. Validation looks great until you let real time pass. The work that's worth doing in this space probably isn't a better wrestling-outcome classifier. It's reframing the target. The Cagematch rating column captures crowd response β€” a real human signal, not a writer's whim β€” and regressing on rating instead of classifying on outcome gives you a less-corrupted target with the same data. That's where the next release goes. Everything's public. CC0 on the dataset, Apache 2.0 on the model and the code. If you want to poke at it, the starter notebook on Kaggle is the fastest way in. If you build something on top, send it to me β€” I'm @currentlyted here and on the rest of the internet. Next issue: how to build a feature-importance audit that survives the kind of label autocorrelation we just walked through. It's the second-most-useful thing I learned from this project, and it generalizes a lot further than the first. β€” Ted