๐Ÿ”Investigation & Research

What Are the Real Sources Used by ChatGPT? A Detective Investigation

BTBen Tannenbaum
โ€ขโ€ข12 min read

Uncovering the actual sources ChatGPT uses by isolating unique phrases and tracing them back to Reddit, directories, and training data. Discover the detective techniques that reveal AI's real information sources.

Detective with hat and magnifying glass investigating - uncovering the real sources behind ChatGPT's responses
Detective techniques reveal the hidden sources behind ChatGPT's responses

๐Ÿ” Investigation Revealed: Through careful detective work, we've discovered that ChatGPT's responses often contain unique phrases that can be traced back to specific Reddit comments, directory listings, and other online sources. This investigation reveals the actual mechanics behind AI-generated content.

The Detective Technique: Isolating Unique Phrases

When ChatGPT generates a response, it often includes very specific phrases or unusual word combinations that seem too distinctive to be coincidental. By isolating these unique sequences and searching for them across the web, we can trace back to the original sources that influenced the AI's training data.

๐ŸงชThe Methodology

1
Identify Unusual Phrases:

Look for specific combinations of words that seem distinctive or unusual in ChatGPT's responses.

2
Search with Quotes:

Use Google search with exact phrase matching (quotation marks) to find the original sources.

3
Trace the Connection:

Compare the original source with ChatGPT's response to confirm the connection.

Case Study: The Skincare Reddit Connection

A perfect example of this detective work in action involves a ChatGPT conversation about "best skincare for men." The AI's response contained several distinctive phrases that led us directly to its source.

AI

ChatGPT Response

User: best skincare for men
ChatGPT said:
The best skincare routine for men should be simple yet effective, focusing on cleansing, moisturizing, and sun protection. Here's a straightforward guide:

1. Cleanser (Face Wash)
Purpose: Removes dirt, oil, and pollutants from the skin.

Best Picks:

โ€ข Oily Skin: Gel or foaming cleansers (e.g., CeraVe Foaming Facial Cleanser, La Roche-Posay Effaclar).

โ€ข Dry Skin: Hydrating cleansers (e.g., CeraVe Hydrating Cleanser, Kiehl's Ultra Facial Cleanser).

โ€ข Sensitive Skin: Gentle cleansers (e.g., Vanicream Gentle Facial Cleanser).

2. Exfoliation (2-3x Per Week)
Purpose: Removes dead skin cells, preventing clogged pores and dull skin.

Best Picks:

โ€ข Chemical Exfoliants: Salicylic acid (for acne-prone skin) or glycolic acid (for brightening).

โ€ข Recommended Brands: Paula's Choice 2% BHA Liquid, The Ordinary Glycolic Acid Toner.
ChatGPT conversation screenshot showing the skincare advice with highlighted phrases
The actual ChatGPT conversation that started our investigation

Detective Notes: The highlighted phrases "simple yet effective" and "Exfoliation (2-3x Per Week)" with its specific formatting became our key search terms in the investigation.

r/

The Reddit Source Found

View Original โ†’
r/beauty โ€ข Posted by u/[username]
Hey there! It's awesome that you're looking to establish a solid skincare routine. For men, a simple yet effective routine can make a noticeable difference in skin health. Here's a suggestion

Morning Routine:

Cleanser: Start with a gentle cleanser like to remove dirt and oil without stripping your skin.

Moisturizer: Follow up with a lightweight moisturizer like to keep your skin hydrated throughout the day.
*Sunscreen: Don't forget to apply a broad-spectrum sunscreen like to protect your skin from UV rays.

Evening Routine:

*Cleanser: Use the same cleanser to wash away the day's impurities.
*Exfoliation (2-3 times a week): Incorporate an exfoliating product like to help unclog pores and remove dead skin cells.
Serum (optional): Consider using a serum like to address specific skin concerns, such as hydration or fine lines.
Night Cream: Finish with a nourishing night cream like to give your skin a boost of hydration and recovery while you sleep.

Remember, consistency is key. Start with these basics and adjust based on your skin's needs. It's always a good idea to patch-test new products and introduce them one at a time. Feel free to explore Sephora's best sellers to find products that resonate with you. Best of luck on your skincare journey!

Let me know if you have any questions or need more personalized recommendations. ๐Ÿ˜Š i suggest also some product that you can make it in your house and I give more tips

Smoking Gun Evidence: Notice the exact phrase matches: "simple yet effective routine" and "Exfoliation (2-3 times a week)" with identical formatting. The probability of this being coincidental is virtually zero.

๐Ÿ’กThe Smoking Gun

The Reddit comment contained not just the same advice, but the exact same unusual phrasing, formatting, and even the specific frequency recommendation. This level of specificity makes it virtually certain that this Reddit comment was part of ChatGPT's training data.

Now comes the moment of truth. Using our detective technique of exact phrase searching, we uncovered irrefutable evidence that connects ChatGPT's response directly to Reddit's training data. The side-by-side comparison below reveals just how precise this connection is โ€“ the similarities are so specific that they eliminate any possibility of coincidence.

๐Ÿค–ChatGPT Response

"simple yet effective"

Found in introduction and conclusion

"Exfoliation (2-3x Per Week)"

Specific formatting and frequency

Multiple bullet points structure

Morning/evening routine format

๐Ÿ“ฑReddit Source Found

"simple yet effective routine"

Exact phrase match in r/beauty

"Exfoliation (2-3 times a week)"

Identical formatting and frequency

Same bullet structure

Morning/evening routine breakdown

This side-by-side comparison removes any doubt about the connection between ChatGPT's response and Reddit's training data. The AI system didn't generate this advice from scratch โ€“ it reproduced specific phrases, formatting choices, and structural elements from a Reddit comment with surgical precision. When we see phrases like "simple yet effective routine" and the exact formatting of "Exfoliation (2-3 times a week)" appear identically in both sources, we're looking at direct evidence of how Reddit content flows into ChatGPT's knowledge base.

Similar Discoveries by Other Researchers

This detective approach isn't unique to our investigation. Marketing professionals and SEO experts have been using similar techniques to reverse-engineer ChatGPT's sources.

๐Ÿ”ฌThe Go Fish Digital Case Study

Chris Long from Go Fish Digital discovered that ChatGPT was referencing his company in ways that didn't match their official marketing materials. By investigating unusual phrases like "merges technical SEO, digital PR, and ORM," he traced the source back to a specific directory listing.

"When ChatGPT mentioned we worked with 'Thryv' - a company I'd never heard of in 9 years - I knew something was up. A simple Google search for 'Go Fish Digital' + 'Thryv' led me to the exact directory article ChatGPT was using as a reference."

Beyond training data, recent discoveries show that ChatGPT actively searches Reddit in real-time when providing current information.

๐ŸŒThe "site:reddit.com" Discovery

Andrei Baloleanu revealed that ChatGPT uses the classic "site:reddit.com" search operator when looking for current information. When he asked about "the best git client of June 2025," ChatGPT performed a search for "best git client windows June 2025 site:reddit.com."

Key Insight:

"There's nothing fancy about it, just the classic 'site:reddit.com' used by everyone. It's time to go and do some searches for your money keywords with 'site:reddit.com' at the end and see what comes up."

ChatGPT search logs showing site:reddit.com queries - evidence of AI using Reddit-specific search operators
Actual search logs revealing ChatGPT's use of "site:reddit.com" for git client recommendations

The Official Reddit Partnership

These discoveries gain even more significance when we consider the official OpenAI-Reddit partnership announced in 2024. This partnership gives OpenAI access to Reddit's Data API, allowing real-time access to Reddit content.

Interestingly, Sam Altman, OpenAI's CEO, has historical connections to Reddit - he was on Reddit's board and was briefly considered for the CEO position. This relationship helps explain the deep integration between ChatGPT and Reddit content.

Reddit Integration Timeline

1

Training Data

Reddit content in original training

2

Live Search

site:reddit.com queries

3

Official Partnership

Real-time API access

Implications for Content Creators and Marketers

๐Ÿ“

Reddit Strategy

Actively participate in relevant Reddit communities with helpful, detailed responses that could become training data.

๐Ÿ”

Directory Optimization

Ensure your business information is accurate across directories that might be referenced by AI systems.

๐ŸŽฏ

Source Tracking

Use these detective techniques to identify and influence the sources that shape how AI systems present your brand.

๐Ÿ”ฎThe Future of AI Source Investigation

As AI systems become more sophisticated, understanding their sources becomes crucial for anyone looking to influence how their brand or content appears in AI-generated responses. These detective techniques will only become more valuable as AI adoption continues to grow across search and content discovery.

Try It Yourself: A Step-by-Step Guide

1Ask ChatGPT About Your Industry

Start with a question relevant to your business or expertise area. Look for specific advice, recommendations, or unusual phrasings.

Pro Tip: Need inspiration? Visit getaiso.com to access real questions people ask ChatGPT (collected through voluntary opt-in) to see what's trending in your industry.

2Identify Unique Phrases

Highlight phrases that seem distinctive, specific, or unusually formatted. These are your best leads for tracing sources.

3Search with Exact Quotes

Use Google with quotation marks around the unique phrases. Try variations and combinations to find potential sources.

4Verify and Document

Compare the found sources with ChatGPT's response. Document strong matches for future reference and strategy development.

โšกKey Takeaways

  • โ€ขChatGPT's responses often contain traceable phrases from specific online sources
  • โ€ขReddit is a major source for both training data and live search results
  • โ€ขDirectory listings and forum comments can significantly influence AI responses about brands
  • โ€ขSimple detective techniques can reveal the sources behind AI-generated content
  • โ€ขUnderstanding these sources is crucial for AI optimization strategies

๐Ÿ“ŠInvestigation Summary Infographic

Detective Investigation: The Real Sources Behind ChatGPT - Infographic summarizing key findings including detective techniques, case studies, and timeline
Complete visual summary of our detective investigation methodology and key findings about ChatGPT's sources