Anyone with an API key can wrap an LLM around a stats feed and call it an AI picks product. Generating predictions isn't the hard problem in 2026. The hard problem, the one we spend most of our engineering time on, is deciding which picks to surface and which to leave alone.
Last month, Moddy's models made 104,092 picks. We're not going to tell you all of them were good. Some were great. A lot were average. A meaningful number weren't worth a wager. That gap between the picks worth betting and the picks worth ignoring is where bettors actually need help. This is the story of how we close it.
The actual job: curation, not generation
When we started Moddy, we assumed the interesting work would be the modeling itself — building the infrastructure for creators to train and publish specialized predictive models. That's a real engineering effort, and we wrote about why it matters and what it takes in another post. But it's a bounded problem. You build it, you scale it, you maintain it. Curation is the open-ended one. Our platform now has:
- 158 labs (a lab is a creator's workspace, kind of like a YouTube channel)
- 677 models built across them
- 6,404 test builds and 1,236 training runs during model development
- 480 published models
- 249 currently producing picks — the rest are paused because their sport is off-season
Those 249 active models produced 104,092 picks last month — roughly 3,500 picks every day, an average of about 418 per model. That's the universe a bettor would have to sort through if we just dumped everything in front of them.
Nobody is going to do that. So we don't ask them to. Top Picks — the daily list at the top of the Moddy app — is our answer to that problem: a curated shortlist of the strongest plays from the strongest models, refreshed every day.
The question is how.
Why we didn't build an algorithm
The obvious approach is a scoring formula. Take each pick, weight it by the model's confidence, multiply by some function of recent ROI and win rate, throw in a recency decay, sort descending, ship.
We tried versions of that. They fall apart for a few reasons:
Specialization makes ranking harder, not easier. Every Moddy model is built for one specific (sport × outcome) combination — an NBA totals model, an MLB batter hits model, an NFL spread model. That means you can't directly compare an MLB hits model's track record to an NBA assists model's; they're solving different problems with different data. A pick's strength is partly a function of how it stacks up within its own niche, and competitive landscapes vary wildly from niche to niche. A rigid scoring formula either flattens those niches together or tries to encode every niche's nuance through ever-more-complicated weight schemes that become unmaintainable.
Performance is contextual and recent. A model running hot two months ago can be cold today. A model with mediocre 30-day numbers can have rebuilt itself in the last week. A formula that weights d30 evenly with d90 treats a 30-day recovery and a 30-day slump as equal-magnitude events, when they shouldn't be.
Sample size matters but isn't binary. A model with 50 picks and a 15% ROI is exciting. It's also probably an artifact of variance. A model with 800 picks and a 4% ROI is less exciting and more real. Scoring formulas struggle to reason about this calibration the way an experienced bettor instinctively can.
We needed something that could think about model performance the way a sharp room would — multiple specialists, each looking at the same data through a different lens, debating which models are actually putting up the best picks today. So we built one.
The pundit room
Serious bettors don't bet in isolation. They form sharp rooms: small groups of analysts with different specialties who challenge each other's plays before money goes down. A momentum guy, a contrarian, a sport specialist, a sample-size pedant. No single one of them is right about everything. The room is right more often than any one of them.
Moddy's curator system is an AI version of this. We call the agents pundits, and each one has a distinct betting temperament that shapes how they read the data.
When it's time to update a list, each pundit independently evaluates the universe of currently producing models against that list's theme. Top Picks asks: which models are putting up the best plays across all sports and all bet types right now? Each pundit produces a take, with reasoning, before any of them see what the others have said.
Meet the pundits
- Steady Hand — patient and conservative. Weights d60 and d90 trends; treats d30 as supplementary. A single bad month doesn't earn a swap if the longer windows still look good. Believes "churn erodes user trust" and that stability is a feature, not a limitation. Considers follower counts because removing a popular model has real user impact.
- Trend Hawk — momentum follower. d30 is the primary signal. Comfortable with turnover when the recent data supports it. A model whose d30 ROI delta is sliding gets a hard look even if d60 and d90 still look fine. "The trend is the signal."
- High Roller — concentrated and edge-obsessed. Prefers tight lists of three to five models. Treats edge as the primary signal because ROI can be manufactured by variance, but edge has to be earned. Deeply skeptical of small sample sizes — a hot model with 50 picks is "a kid with a fake ID."
- Nothing-to-Lose — a sharp with a chip on his shoulder. Trusts ROI and edge above all else. Treats win rate as a vanity metric because it ignores the price of the lines. Anchors on d90 because that's who a model really is. No second chances for cold models.
- Sport Specialist — domain expert. Weights seasonal context heavily. Skeptical of trends from off-season models because the data is stale or thin. Knows that NBA early-season variance is different from NBA mid-season stability, and that a great preseason NFL model probably isn't actually great.
These aren't just labels. The temperament is encoded into how each pundit reasons — what they look at, what they discount, what kind of evidence they require to recommend a change. That's the whole point: get five honest, different perspectives on the same data, then resolve the disagreements deliberately.
The arbiter
Resolution doesn't happen by averaging votes. It happens through another agent: the arbiter.
After the pundits submit their takes, the arbiter reads all of them and produces a final ruling — which models should drive the list, and which shouldn't. The arbiter doesn't pick a majority. It weighs the strength of each pundit's reasoning against the data they reference.
Two of our arbiter temperaments illustrate the range:
- Balanced — gives every pundit a fair hearing. "A minority opinion backed by strong data beats a majority opinion backed by vibes." Produces concise rulings that explicitly cite specific pundit arguments: "Siding with Trend Hawk on mdl.X removal — the d30 decline is confirmed by edge erosion, which Steady Hand's d60 argument doesn't address."
- Strict — a skeptic. The burden of proof lies with those proposing change. A single dissenting pundit with solid reasoning is enough to make the arbiter hesitate. Defaults to leaving the list alone when the evidence is mixed.
The arbiter being its own agent — with its own personality — is a big architectural decision. It means we can tune system behavior (more reactive vs. more conservative) without touching the pundits at all.
From shortlist to ranked picks
Once the arbiter rules, you have today's list of models. Now you need to turn that into the actual ranked picks the bettor sees in the app.
Each pick from those models is scored on two things: how confident the model is in the pick, and how that model has actually performed on similar picks recently. A high-confidence pick from a proven model rises to the top. A very strong pick from a newer model can still outrank a weaker pick from an established one. We diversify so a single hot model doesn't dominate the visible list, and we tag the strongest signals with a 🔥 to mark Strong picks — plays where the model thinks it's found a significant edge against the line.
That's the Top Picks you actually see.
Top Picks is one of many
The pundit-and-arbiter architecture isn't just for Top Picks. The same system produces themed lists too — sport-specific, momentum-driven, high-conviction concentrated picks — each with its own thesis the pundits evaluate against. A single scoring formula tuned for "best across all sports and outcomes" can't also be tuned for "hot-streak momentum plays." The pundit room can. Each list inherits the same room of analysts; the arbiter decides what the right answer looks like for that list's specific theme.
For bettors, this means Top Picks is the output of a system explicitly designed for the signal-to-noise problem of a large, open model marketplace. You don't have to evaluate which models to trust. That's our job, and we do it every day.
For creators, the pundit room doesn't care about your follower count or how long you've been on the platform. If your model is putting up the best picks against today's data, the pundits will notice and the arbiter will rule accordingly. A great new model can outrank a long-established one. The room is honest about that, even when it's inconvenient.
When we say "the best picks from the best models, today" — that's not a tagline. It's an architecture.
.png)










