Most Tools Guess. We Measure. | Below the Surface

Every brand wants to know the same thing: How do I get AI to recommend my product?

It’s the new version of “How do I rank on Google?” And just like early SEO, the market is flooded with people claiming they’ve figured it out. Tools promising “AI optimization.” Agencies selling “prompt engineering for brand visibility.” Consultants with playbooks built on assumptions nobody has validated.

We wanted the answer too. So instead of guessing, we designed an experiment.

What We Tried to Prove

We set out to find causal levers inside AI shopping assistants. Not correlations. Not vibes. Actual causal mechanisms, the kind where you could tell a brand, “Add this attribute to your product page and your AI visibility will improve.”

We developed a methodology called Conditional Attribution Lift (CAL), borrowing from epidemiology and causal inference, fields that have spent decades separating correlation from causation in observational data.

The idea: AI platforms don’t just recommend products, they explain why; “I’m recommending the Hoka Clifton 10 because it’s lightweight, has excellent cushioning, and works well for daily training.” Those explanations are data. If we could decompose thousands of them into structured attributes and control for confounding factors like brand popularity, we might isolate genuine causal effects.

We built a five-step pipeline: contrastive extraction, latent factor decomposition, explanation-controlled estimation, doubly robust estimation, and sensitivity analysis. We defined five validation criteria with pass/fail thresholds before running the analysis.

It Failed

One out of five criteria passed.

The one thing that worked: our data pipeline could successfully distinguish how AI talks about one brand versus another. The underlying extraction was sound.

Everything else fell apart. Our attribute estimates didn’t converge toward the benchmarks we’d set. A control variable that should have had zero relationship with recommendation quality showed a highly significant effect, a red flag for residual confounding our model couldn’t account for. Effect estimates varied wildly across platforms. An attribute that appeared to matter on ChatGPT showed the opposite pattern on Claude. And every significant result could be explained away by a modest unmeasured confounder.

We tried to fix it. We normalized thousands of fragmented brand names into canonical brands. A small-scale test showed improvement, but when we ran the full validation, we were back to one out of five. The improvement had been a sampling artifact.

Why It Can’t Work (Yet)

The empirical failure led us to ask a harder question: is this a data problem, or a structural one?

We stress-tested the methodology and results through independent critique from Claude Opus 4.5 and ChatGPT 5.2. The conclusion was unambiguous: five fundamental problems make causal attribution from AI recommendation data structurally impossible at this stage:

The explanations aren’t causal mediators: AI explanations may be post-hoc rationalizations, generated after the model has already selected its answer, not a window into the actual decision process.
You can only observe what gets recommended: We never see products that were considered and rejected. This creates a selection bias that no statistical technique can fully correct from observational data alone.
Brand is a super-confounder: Brand simultaneously affects whether a product gets recommended, which attributes get mentioned, and how the explanation is framed. Controlling for it blocks the effect you’re trying to measure. Not controlling for it biases everything.
Queries create the results they’re measuring: Ask an AI “What’s the best moisturizer with ceramides?” and it returns products mentioning ceramides — not because ceramides causally drive recommendations, but because you asked for them.
Platforms aren’t comparable: Each platform formats responses differently. Some list five products, some list ten, some embed recommendations in prose. Cross-platform causal claims require comparable outcome definitions, and we don’t have them.

The core contradiction: the explanation factors we needed to control for confounding are the same factors that mediate the effect we’re trying to measure. You can’t simultaneously use something as a control variable and as part of the causal pathway.

The Pivot

We had a choice: chase causality, collect more data, try more sophisticated models, run perturbation experiments, and spend months on a problem that might be structurally unsolvable. Or accept what the data was telling us and pivot to what we could actually deliver with integrity.

The analogy that clarified the decision: nobody asks Datadog to prove causality. They ask it to show what’s happening, alert when things change, and provide signals humans can act on.

So we pivoted. From “AI Visibility Optimization” to “AI Visibility Intelligence.”

We stopped saying “change X to improve rankings.” We started saying “here’s how AI sees your brand, across platforms, over time, compared to your competitors.”

Then SparkToro Confirmed What We’d Found

On January 27, 2026, Rand Fishkin and Patrick O’Donnell of Gumshoe published research on AI recommendation consistency. They had 600 volunteers run identical prompts through ChatGPT, Claude, and Google AI nearly 3,000 times across 12 different prompt categories, then analyzed how consistent the results were.

Their findings were striking:

Rankings are essentially random. There’s less than a 1-in-100 chance that the same AI tool, asked the same question twice, will return the same list of brands. For identical ordering, it’s roughly 1 in 1,000.
Visibility frequency is the reliable signal. Whether a brand appears at all, measured across many responses, is far more consistent than where it ranks within any single response.
Platforms differ significantly. Each AI tool has its own patterns. Cross-platform consistency is low for specific rankings but meaningful for visibility presence.

We read that and recognized our own data. Our validation had already shown cross-platform instability. Our position data was unreliable. SparkToro arrived at the same conclusions through a completely different methodology, validating precisely the metric we’d already built our product around: how often a brand appears, not where it ranks.

Where We Stand

We aren’t claiming to have cracked AI algorithms. We tried, rigorously, and documented why it doesn’t work yet.

What we offer is measurement. We track how AI shopping assistants see, describe, and recommend brands across platforms and over time. We show you where you appear, where you don’t, what language AI uses to describe you, and how that compares to your competitors.

We can tell you that one platform emphasizes durability twice as much as another in your category. We can tell you that a competitor appears in queries where you’re absent. We can tell you that your AI visibility shifted this month.

We cannot tell you why with causal certainty, and anyone who claims they can is selling you something they haven’t validated.

The market for AI visibility is in its pre-theory phase, like SEO before anyone understood PageRank. The companies that build credibility through honest measurement will outlast the ones selling optimization snake oil.

Most tools guess. We measure.

”In emerging markets, credibility compounds faster than hype.”