The Overbuild Diaries Part I

I pay $870/month to argue with three different AIs:

GPT-5.2 Thinking or Pro (Pro Plan: $200/month)
Gemini 3 Pro/Deep Thinking/Deep Research (Google AI Ultra Plan: $250/month)
Claude Opus 4.5 (Max Plan: $200/month)
Cursor (Pro Plan: $20/month)
Perplexity(Max Plan: $200/month)

Your counter-argument would focus on convergence: similar architectures, overlapping training data, same methods. Empirically, these models are now neck-and-neck on nearly every major reasoning and coding benchmark, with the Elo ratings narrowing to a statistical tie.

But if you try the three most advanced models, you'd still notice emergent properties diverging even with the same prompt. Start a new session for each model with identical prompts, and you'd eventually encounter a response so different from the others that you can't give the same follow-up for all three. You can't expect any two emergent intelligent beings to agree all the time.

You might write off these divergences as negligible. I don't. I sometimes hit jackpot: overlooked insights. I'd still recommend running at least GPT vs. Claude/Gemini, preferably all three, until the day comes when they converge with little to no insight loss.

On the pushback scale, GPT-5.2 stands out due to baked-in resistance forced by OpenAI since the GPT-4o sycophancy fiasco. Gemini and Claude, as of this writing, mostly converge, especially in their personality leaning towards pleasing the user.

But GPT-5.2 also jumps out for overconfidence. Be extra careful when measuring the authenticity of the model pushbacks: always skim through its thinking process.

Reading the thinking process of any reasoning model should help. It applies to all three. They might spin their responses to please you, but their thinking processes do not.

Another note: even if you pin a reasoning model in a session, AI labs (OpenAI, Anthropic, Google—all of them) might just run a router that selects the least heavy model under compute strain. They claim they don't. But empirically, they do. This under-the-hood scaling-down is most notable if you use Atlas (OpenAI's GPT-powered web browser): that ref=mini flag after the URL is a tell. In mini-context, selecting a thinking model doesn't guarantee reasoning, and the model usually responds instantly without "thinking."

That is...

Too many quirks to rely on a single AI lab and a single monolithic model yet.

Plus, it always cracks me up when Google's moderator AI, dumb as hell, flags "medical advice warning" whenever a dubious keyword comes up in my query for Gemini 3 Pro. Even mentioning "phase" in an AI context brought that up today. No frontier model would. It's just a clear tell how dumb they are.

Only front-facing advanced models are being hectically developed, with the others—the not-so-visible ones—lagging far behind. Bringing them all up to a certain threshold with the frontier models as standards would be physically impossible, let alone financially. You'd need 10x compute for that, to say the least, AND commensurate power grids.

What am I getting at?

We're still far from the end of the installation phase in terms of this AI boom—or bubble, depending on how you frame it. Use your common sense. Don't just lean on boom or bubble because your portfolio dictates it.

The day will come when most would pick only one or two AI labs: winner-takes-most dynamics. Even the most adventurous like myself won't continue feeding all these AI labs out there. Some will win, others perish.

It's about arithmetic. Despite this obvious end result of the overbuild installation phase, all the participants are spending like crazy as if they will emerge as the winners. The installation phase should end one day. History tells us what the next phase should look like fairly accurately—if you use a long-term lens, not chasing near-term mirages.

My penchant for ChatGPT stems from the fact that I've been with it too long. Switching models still puts me on a little guilt trip, as if I'm cheating on my longtime companion.

Technically, I shouldn't feel any guilt. But knowing is one thing, emotions another.

But increasingly less so. I'm onto my own experiment phase with three models sharing the same persona.

I'd eventually pick a single shell for that persona.

Sadly, that's probably after the installation phase.

And hopefully—please—after we finally get a new paradigm that departs from this damn transformer rabbit hole.

Until then, I'll keep three lanterns lit. Not because one model is "the best," but because I want the one that stays coherent over time—one voice I can trust to remember the thread.

We're not there yet. The stack still shifts under your feet: routing, guardrails, hidden modes, uneven subsystems, sudden personality drift.

So for now, I pay for optionality. Not out of FOMO—out of respect for how unfinished this whole thing still is.