We have two independent ways to find leads: embedding similarity (content/structure match to Pete's top leads) and GoDaddy HTML scan (text match on GoDaddy signature in page structure). Here's how 156,142 scanned sites split across both.
Feb 2026Pete List v2 · 23-site centroid156,142 HTML files scanned in R2 (Jan + Feb)
Embedding similarity
GoDaddy HTML scan
Found by both
Neither
GoDaddy count is from a full scan of all 156,142 HTML files in R2 (via Cloudflare Worker). 17,420 files contain GoDaddy signatures (11.2%). The overlap (~3,166) is estimated from tier-level GoDaddy rates measured by sampling.
What each method found
Embedding only12,579
Only findable by embedding
Non-GoDaddy sites with genuine content similarity to Pete's top leads. Built on Wix, WordPress, Squarespace, custom platforms. No HTML signature to scan for — only the embedding catches them.
GoDaddy sites that didn't pass similarity filters (too few words, no mobile, low similarity score, etc.). The embedding missed them, but a GoDaddy signature scan on the raw HTML surfaces them.
GoDaddy sites that also ranked as similar. Mostly Elite/Strong tier (~2,470) where template matching inflated scores, plus ~696 in lower tiers. Would be found by either method alone.
Not GoDaddy and not similar enough to Pete's leads. May still be valid businesses but don't match the current ICP signal.
Actionable leads by method
12,579 embedding only
3.2k
14,254 HTML scan only
Embedding onlyBothHTML scan only
Total unique actionable sites
12,579 + 3,166 + 14,254 = deduplicated across both methods
~30k
The two methods are mostly complementary. Only ~3.2k sites overlap. The embedding uniquely surfaces 12,579 non-GoDaddy content matches that no HTML scan would find. The HTML scan uniquely surfaces 14,254 GoDaddy sites that fell below similarity thresholds. Together: ~30k actionable leads from 156k scanned.