The Brave of AI Search: Why Full Opt-In Conversation Data Wins

Every AI search analytics vendor faces the same uncomfortable question: where does your conversation data come from? The honest answers across the industry range from "we make prompts up and replay them" to "we buy clickstream from brokers and prefer you don't ask." There is a better answer, and a browser company already proved it at scale. Brave built an independent search index almost entirely on data that users chose to contribute - anonymized, sanitized, and aggregated so aggressively that Brave itself cannot reconstruct anyone's session. We think that model is the right one for AI search too. This post explains Brave's Web Discovery Project, the analogy to what we do at Aiso, and why full opt-in isn't just the ethical choice - it's the more durable and more honest dataset.

The 30-second version

Brave's Web Discovery Project is strictly opt-in: contributors share searches and page visits, sanitized client-side, sent with no user ID, IP hidden, and unlocked server-side only once enough independent people saw the same thing.
That consented data stream powers Brave Search's independent index - no Google, no Bing underneath.
Aiso applies the same principle to AI search: millions of real ChatGPT, Gemini, and Claude conversations from a panel of 5M+ unique IPs, every one of them voluntarily shared in exchange for clear value, anonymized before analysis, reported only in aggregate.
Opt-in data wins on three axes: legitimacy (survives regulation and platform crackdowns), quality (a panel you can characterize honestly), and durability (a renewable asset, not a gray-area dependency).

What Brave actually built

In October 2021, Brave made its own search engine the default in the Brave browser and shipped something quieter alongside it: the Web Discovery Project. The pitch to users was simple. Brave Search wanted an independent index - results served from its own crawl-and-rank pipeline rather than white-labeled Google or Bing - and the fastest way to learn which pages matter is to learn from real browsing. But Brave is a privacy company, so the usual approach (silently hoover up everyone's history) was off the table. Instead, the Web Discovery Project is off by default. Users who opt in contribute search queries, result clicks, visited URLs, time on page, and some page metadata.

The interesting part is everything Brave does to make that contribution unlinkable to a person:

Sanitization happens in the browser. Queries longer than 50 characters or 7 words are dropped. So is anything containing long numbers, emails, or credentials. Pages behind a login, pages flagged no-index, and URLs that look private (hashes, odd ports, IP addresses) never leave the device. Brave even fetches each candidate page a second time anonymously and compares; if the logged-out version differs, the page is treated as private and discarded.
No user identifiers, ever. Records carry no UID. Messages are sent at randomized intervals in separate requests with one-time encryption keys, so they can't be correlated by timing. A proxy network hides the sender's IP.
A URL needs a crowd before Brave can even read it. Under the STAR protocol (published at ACM CCS 2022), contributions arrive encrypted and only become decryptable once enough independent users have submitted the same URL. That's k-anonymity enforced by cryptography rather than by a privacy policy: a page one person visited stays unreadable, full stop.

The net effect, in Brave's own framing, is that reconstructing a user's session is not merely against policy - it is not technically possible. And the code is open source, so you don't have to take their word for it.

User identifiers in a Web Discovery record

Independent visitors required before a URL is readable

5M+

Unique IPs in Aiso's opt-in conversation panel

120+

Countries represented in the panel

The analogy: conversations are the new clickstream

Swap "search index" for "AI search visibility" and the problem is identical. To know how AI assistants actually answer real people - which brands they recommend, which sources they cite, how answers shift week to week - you need real conversations, at scale, over time. The question is the same one Brave faced: how do you get that data without spying on anyone?

Our answer is the same as theirs: ask. Aiso's panel exists because users opt in, knowingly, with a clear value exchange - they get free access to premium AI models and features, and in return they agree to share their conversations. No silent SDK buried in a flashlight app, no resold extension telemetry, no scraping of pages people never meant to publish at scale. Conversations are anonymized before analysis, and what brands see is the aggregate: how a category is talked about across millions of real prompts, never an individual's chat log. We also publish exactly who our panel is and where it's skewed - sample composition, geography, intent mix, and the limitations - because a consented panel is the only kind you can afford to describe honestly.

Two pipelines, one principle: consent in, aggregate out

How Brave builds an index and how Aiso builds AI search insight - step by step

Brave · Web Discovery Project

1
Opt in
User flips on the Web Discovery Project in Brave. Off by default.
2
Collect in the browser
Searches, result clicks, page visits, time on page - gathered client-side.
3
Sanitize client-side
PII, long queries, private and authenticated pages discarded before sending.
4
Anonymize transport
No UID, IP hidden by a proxy network, randomized timing, one-time keys.
5
Quorum aggregation
STAR k-anonymity: a URL unlocks only after enough independent visitors.
6
Independent index
Brave Search answers queries without Google or Bing underneath.

Aiso · Opt-in conversation panel

1
Opt in
Users get free premium AI model access and knowingly share conversations in return.
2
Collect conversations
Real prompts and answers across ChatGPT, Gemini, and Claude.
3
Anonymize
Conversations are stripped of identity before analysis.
4
Aggregate the panel
5M+ unique IPs across 120+ countries, characterized openly.
5
Publish the panel's shape
Demographics, skews, and limitations documented for anyone to read.
6
Visibility insights
Brands see how AI assistants actually talk about their category.

In both pipelines the individual is unrecoverable by the time anyone looks at the data. The output is knowledge about the crowd.

The four ways the industry sources this data

To see why the opt-in model matters, line it up against the alternatives. Broadly, AI search and conversation intelligence data comes from four places today:

Sourcing method	Consent	Realism	Durability risk
Synthetic prompt replays (tool fires prompts at APIs)	No humans involved, so no consent question.	Low-to-medium. You see what a model says to prompts a marketer wrote, not what real people ask. High run-to-run variance.	Low legal risk, but answers can diverge from what logged-in users with history actually see.
Purchased clickstream / data-broker panels	Murky. Consent is often buried in a free app or extension's terms; users rarely know their browsing is resold.	High - it is real behavior.	High. Broker datasets get cut off, re-anonymization scandals recur, and regulators are actively closing in.
Scraped shared-conversation pages	None from the user. People who share a link rarely expect bulk harvesting.	Real conversations, but a biased slice - only what people chose to publish.	High. Platforms have repeatedly de-indexed or killed shared-link corpora overnight.
Full opt-in panel (the Brave / Aiso model)	Explicit. Users knowingly trade their conversations for clear value, and can leave.	High - real people, real prompts, longitudinal.	Low and durable. The panel is a consented, renewable asset you can characterize honestly.

Synthetic replays have a real place - we use controlled replays ourselves when a question demands a fixed experiment, and we've written about how much run-to-run variance they carry. But replays answer "what does the model say to my prompt," not "what are real people actually asking, and what do they get back." For the second question there is no substitute for real conversations, and for real conversations there are only two paths: take them without asking, or ask. Brave demonstrated that asking scales.

Why opt-in data is also just better data

It would be easy to frame full opt-in purely as an ethics position. It is one, but it's also a quality and durability position, and that's the part most analyses miss.

1. You can describe a consented panel honestly

Because our panelists chose to join, we can study who they are and publish it - including the unflattering parts, like the fact that recruiting through Reddit ads skews the panel toward early adopters and technical users. Vendors sitting on gray-area data can't publish a demographics page without incriminating themselves, so their "market share" and "visibility" numbers arrive with no error bars and no account of sample bias. Transparency about limitations is a feature only consent can buy.

2. Consent survives crackdowns

The recent history of conversation data is a graveyard of gray-area sources: shared-conversation pages de-indexed overnight, browser-extension telemetry purged from stores, broker datasets pulled under regulatory pressure. Every product built on those sources inherited their shelf life. An opt-in panel has no such dependency - the data relationship is between us and people who said yes, and it renews itself for as long as the value exchange stays fair.

3. Aggregates are the product anyway

Brands don't need any individual's chat log; they need to know how assistants talk about their category across thousands of real prompts. Brave's insight was that if aggregate knowledge is the goal, you can make individual knowledge technically unreachable and lose nothing. The same holds for AI search: anonymized, aggregated conversations contain all the signal - which brands get recommended, in which intents, what people are actually asking, how that shifts over time - with none of the surveillance.

The opt-in playbook, distilled

Six principles from Brave's Web Discovery Project that define the bar for consent-first data collection

Strictly opt-in

Nothing is collected by default. Contribution is a choice the user makes, with a clear value exchange, and can stop making.

Discard at the source

Brave drops long or suspicious queries, emails, long numbers, and anything from authenticated or no-index pages before it ever leaves the browser.

No user identifiers

Records carry no UID and no IP. Brave's proxy network and one-time encryption keys make it technically impossible to reassemble a person's session.

Crowd before content

Under the STAR protocol, a URL is only decryptable once enough independent people have visited it - k-anonymity enforced by cryptography, not policy.

Aggregate-only output

What comes out the other end is an index and statistics about the crowd, never a dossier about a person.

Transparent by default

Brave open-sourced the Web Discovery Project code. Aiso publishes its panel demographics and limitations for anyone to inspect.

What this means if you're evaluating AI search tools

Whether or not you ever use Aiso, put this question to every vendor whose dashboard claims to know what people ask ChatGPT: "Did the humans in your dataset agree to be in it?" Then three follow-ups:

Provenance: Is the data synthetic, scraped, purchased, or opt-in? If the answer is vague ("proprietary panel partnerships"), assume the least flattering option.
Transparency: Does the vendor publish the panel's size, geography, and skews? A vendor that can't show you its sample can't defend its numbers.
Durability: If one platform changed its terms tomorrow, does the dataset survive? Consented panels do; harvested corpora historically have not.

Brave asked these questions of search a decade after everyone assumed surveillance was the price of a good index, and built a real alternative. AI search is young enough to get this right from the start. That's the project we're running: consent-first conversation data, anonymized and aggregated, with the panel's shape published for anyone to audit. The Brave of AI search isn't a slogan - it's a sourcing policy.

FAQ

Is Aiso affiliated with Brave?

No. Brave is the inspiration and the proof of concept, not a partner. We admire the Web Discovery Project's design and hold our own collection to its principles: strict opt-in, clear value exchange, anonymization before analysis, and aggregate-only reporting.

Can I see what Aiso's panel looks like?

Yes - we keep a standing, regularly refreshed write-up of our panel's demographics, geography, intent mix, and limitations, including the skews introduced by how we recruit.

Does opt-in data cover enough volume to be useful?

Brave Search answers billions of queries a year on an index built largely from opt-in contributions. Our panel spans 5M+ unique IPs and millions of conversations across ChatGPT, Gemini, and Claude in 120+ countries. Consent and scale are not in tension; they compound, because a fair value exchange keeps the panel growing.

Sources & further reading

Published June 12, 2026. Brave product details drawn from Brave's public documentation, research publications, and open-source code as of June 2026.

The Brave ofAI search.