01
Falsifiability over confirmation
Karl Popper's falsificationism · The Logic of Scientific Discovery (1959)
What this means: Design experiments that can prove us wrong, not just confirm our beliefs.
How we practice it: We create control groups and test sites that could disprove our hypotheses. Every experiment includes scenarios where we might be wrong.
Example: In our schema markup experiment, we created identical test sites — one with schema, one without — specifically to test whether schema actually matters for AI extraction.
02
Hypothesis first, data second
Henri Poincaré's scientific method · Science and Hypothesis (1902)
What this means: State clear hypotheses before testing, not after seeing the data.
How we practice it: We document hypotheses, predicted outcomes, and success criteria before running experiments. No post-hoc rationalization.
Example: Before testing ChatGPT's response patterns, we predicted specific outcomes for 400 espresso machine queries. We then compared actual results against those predictions.
03
Reproducible methods
Scientific method · Standard scientific practice
What this means: Document everything so others can replicate our work.
How we practice it: We publish detailed methodologies, share test prompts, and provide step-by-step instructions for replicating our experiments.
Example: Our schema markup study includes the exact ChatGPT queries used, the test site URLs, and the methodology so anyone can replicate the experiment.
04
Controlled variables
Experimental design · Statistical experimental design
What this means: Isolate what we're testing — change one variable at a time.
How we practice it: We control for confounding variables and test single hypotheses with proper controls. We don't change multiple things at once.
Example: When testing schema markup impact, we kept content, design, and structure identical — only the schema markup differed between test sites.
05
Statistical significance
Statistical inference · Statistical hypothesis testing
What this means: Use sample sizes large enough to draw meaningful conclusions.
How we practice it: We test with appropriate sample sizes and avoid drawing conclusions from a handful of cases. We measure consistency, not just averages.
Example: Our espresso machine ranking study tested 400 identical queries to measure ChatGPT's response variability — a sample size large enough to detect patterns reliably.
06
Transparent limitations
Scientific honesty · Cargo Cult Science (1974)
What this means: Acknowledge what we don't know and what our methods can't measure.
How we practice it: We list every limitation, caveat, and uncertainty in our findings. We don't oversell results or hide weaknesses.
Example: When publishing on schema markup, we explicitly noted the AI Overviews period and the limited timeframe of our test, even though it weakened the headline finding.
07
Quantifiable results
Measurement principle · Scientific measurement standards
What this means: Define success metrics upfront and measure them objectively.
How we practice it: We establish clear, measurable criteria before running experiments. No subjective assessments — only quantifiable outcomes.
Example: Our schema experiment measured specific improvements in information extraction, additional structured data retrieval, and source attribution.
08
No cherry-picking data
Statistical integrity · Nature: How to fight cherry-picking
What this means: Report all results, not just the ones that support our hypotheses.
How we practice it: We show complete responses, including ones that don't support our hypotheses. We report null results and unexpected findings.
Example: In our ranking experiment, we published all 400 query results — including cases where ChatGPT was inconsistent, not just the ones that supported our hypothesis.
09
Iterate based on evidence
Poincaré's convention · The Value of Science (1905)
What this means: Update our beliefs when data contradicts them, not the other way around.
How we practice it: When experiments contradict our assumptions, we change our recommendations and methodology — not our interpretation of the data.
Example: Findings about ChatGPT's non-ranking nature led us to revise our approach, moving away from traditional ranking metrics.
10
Subject to peer review
Community validation · Nature on the peer review process
What this means: Welcome scrutiny, feedback, and challenges from the community.
How we practice it: We publish methodologies publicly and invite others to test, challenge, and improve our work. We respond to criticism constructively.
Example: This page itself is an invitation for peer review — we share our methodology openly so it can be examined and pushed back on.