Marketing — Lead Discovery

The Lead Discovery pipeline turns "companies running competitor X's chat widget right now" into ranked, contact-attached prospects ready for outreach. It's the first step in the Marketing → Outreach workflow.

The pipeline is tenant-driven. Each tenant's Ideal Customer Profile (ICP) — set in Settings → Company Profile — drives both the BigQuery filter (which competitors' users to surface) and the Sonnet scoring rubric (what counts as a fit, what disqualifies). Run for any tenant whose ICP is filled in: npx tsx --env-file=.env.local scripts/leadgen/discover.ts --tenant <slug>. The --vendors flag still works as a manual override when you want to harvest a subset of the tenant's competitor list.

What it does

Given a list of vendors you're positioned against (Zendesk, Qualified, Intercom by default for Hydra), the pipeline:

Queries HTTP Archive on Google BigQuery for every site that has the vendor's widget detected in the latest monthly crawl. Free; counts against your project's 1 TB/month BigQuery free tier (~75 runs available before any cost).
Scrapes each homepage via Firecrawl for the company description, target market, and any visible contact info.
Scores each company 1–10 for fit with Hydra's ICP using Anthropic Haiku. Companies below the floor (default 6) are dropped before contact discovery.
Extracts a contact email from the homepage when one is publicly visible. v1 looks at mailto: links and plain emails on the homepage; future versions will also crawl /about and /team pages.
Verifies the email via SMTP RCPT TO probe (best-effort; many ISPs block port 25, in which case the contact is recorded as unverified rather than dropped).
Generates a personalized opening line referencing something specific from the prospect's site, ready to drop into a campaign.
Persists everything as a prospect record, along with a comparison URL linking to the matching /compare/hydra-vs-X page on hydra-help.com.

What gets stored

For each prospect, Hydra records:

Identity: domain, company name, industry, estimated employee count
Vendor signal: which competitor's widget was detected (drives the comparison-page CTA)
Fit: 1–10 score from Haiku
Contact: email + best-effort verification status
Outreach material: personalized opening line, comparison URL
Provenance: which discovery run produced this prospect + when

Cold prospects are kept separate from your live CRM. They cross over only on real engagement (reply or click — see Engagement tracking and CRM promotion in Marketing — Outreach). This keeps your Leads and Contacts lists clean.

Running a discovery from the UI

Open Marketing → Outreach and scroll to the Lead Discovery panel. Click Start new run. The form has four fields:

Channel (required) — pick the email channel that this harvest is sourced for. Replies to campaigns built off this run will route into this channel's inbox. Only kind='email' channels with an email_address set show up in the dropdown. If the dropdown is empty, create an email channel first — discovery runs are channel-scoped and there's nowhere for replies to land otherwise.
Vendor filter — chip toggles for the eight tracked competitor platforms. Leave all unchecked to fall back to the competitors list on your tenant's ICP (Settings → Company Profile).
Minimum fit score — the slider sets the floor Sonnet uses to drop low-fit candidates. Default 4 (permissive). Raise to 7+ for tighter filtering — most quality candidates land at 6-8.
Max candidates — BigQuery LIMIT. Default 50, capped at 150 per run. Each candidate runs ~$0.015 (Firecrawl + Sonnet). 50 candidates ≈ $0.75; 150 ≈ $2.25.

Click Start run and the page transitions to a progress view: a bar showing qualified-vs-max, three-tile counters (qualified / with-contact / cost), and a "polling every 4s" indicator. The harvest runs server-side in batches of 5 — you can leave the page and come back later. When the run completes, the page refreshes automatically and the new prospects appear in the list below.

Failed runs surface their error in red with a Try again button. If the failure is "recursion cap reached," the run hit the 30-batch (150-candidate) self-recurse cap; raise MAX_RECURSION in the worker route or split into two smaller runs.

Running a discovery harvest (terminal)

The terminal CLI is still available and supports a few advanced flags the UI doesn't expose (--rank-cap, --rank-floor, --snapshot-date, --skip-enrichment, --dry-run).

--channel is required: every harvest is bound to an email channel so reply routing has a destination once those prospects become campaign recipients. The channel must be kind='email' with an email_address set. Pass either the channel's name or its UUID.

# Top-3 priority vendors, 100-row sample
npx tsx --env-file=.env.local scripts/leadgen/discover.ts \
  --tenant hydra --channel "outbound-sales" \
  --vendors zendesk,qualified,intercom --limit 100

# Top-1M sites only, Intercom only, dry run (no DB writes)
npx tsx --env-file=.env.local scripts/leadgen/discover.ts \
  --tenant hydra --channel "outbound-sales" \
  --vendors intercom --limit 1000 --rank-cap 1000000 --dry-run

# Just dump the domain list — no Firecrawl, no Haiku, no DB
npx tsx --env-file=.env.local scripts/leadgen/discover.ts \
  --tenant hydra --channel "outbound-sales" \
  --vendors qualified --limit 500 --skip-enrichment

The script is idempotent on (tenant_id, domain). Re-running with the same vendor filter UPSERTs in place — it won't reset a prospect that's already been emailed.

Cost picture

Source	Cost per 100 prospects	Cost per 1,000 prospects
BigQuery (HTTP Archive)	~$0 (free tier)	~$0 (free tier)
Firecrawl (homepage scrape)	~100 credits	~1,000 credits
Anthropic Haiku (ICP + personal line)	~$0.10	~$1.00
SMTP verification	$0	$0
Total	~$0.10 + Firecrawl credits	~$1.00 + Firecrawl credits

No third-party subscriptions are required. The pipeline is built to be subscription-free by design — Apollo, Hunter.io, Anymail Finder, etc. are all optional pay-per-use fallbacks that only get invoked on highest-fit rows that fail the free SMTP probe (none of those fallbacks are wired in v1; all v1 traffic flows through the free path).

How fit scores work

The Haiku classifier runs against the homepage scrape and returns an integer 1-10 plus a frozen-vocabulary flags array. The scoring rubric is default-to-fit: the BigQuery prefilter has already established that every domain runs a paid customer-support SaaS, which is itself strong evidence of B2B + paying-for-tools + has-a-support-function. Haiku's job is to find disqualifiers, not to re-prove fit.

Score	Meaning
9-10	Clear bullseye. Mid-market B2B SaaS, multiple positive signals, no disqualifiers. Construction-tech bullseyes apply the +2 vertical bonus to land here.
7-8	Clearly B2B SaaS. Most positive signals present, no hard disqualifier. Default sweet spot — most qualified rows live here.
5-6	Probably B2B SaaS but the homepage is thin on signals or slightly ambiguous. Often these are real targets that just haven't told their story well.
3-4	Ambiguous. Could be services-with-product attached, could be a niche industry that's not in our wheelhouse. Borderline.
1-2	Hard disqualifier present (consumer brand, agency-for-clients, healthcare, government, mega-enterprise, pre-product).

Default fit-score floor: 4. Prospects scoring ≥4 are persisted for the campaign builder. Lower the floor to ≥3 to be more permissive; raise it to ≥6 to tighten.

What the tool can't see

The classifier only reads the homepage. It does NOT have visibility into:

Customer count
Funding stage
Employee count (unless displayed on the homepage)
Support team size

Those signals live in Crunchbase, LinkedIn, Apollo. They're checked downstream by the campaign-builder review pass or by manual review. Do not lower a row's score because the homepage didn't show these — that's the bug the prompt was rewritten to fix.

Expect occasional surprises:

A 9 turns out to be too late-stage when you check LinkedIn → you de-select before sending
A 5 turns out to be a perfect fit when you check Crunchbase → you keep it

That's expected, not a bug.

Vertical bonus (construction-tech)

Hydra's first GTM wedge is construction-tech. When Haiku flags a homepage as construction_vertical, the runner adds +2 to the fit score (capped at 10). Configure via env var LEADGEN_VERTICAL_BONUS=construction:2 — set in .env.local to enable. Other verticals can be added the same way (e.g. swap to legal:2 if a future SOC 2 cert opens that segment).

Caveats and known limits

Vendor-detection ≠ support use case. Many companies run Intercom / Zendesk on their marketing site as a sales-chat lead capture rather than for actual customer support. The classifier doesn't try to distinguish — those companies are still B2B and still paying for SaaS, so they pass through to enrichment.
SMTP verification is best-effort. Port 25 outbound is blocked from many residential ISPs (Comcast, Spectrum) and most cloud platforms (Vercel, AWS). When the probe times out or can't connect, the contact is recorded as unverified rather than dropped — the email pattern is still useful, the verification just couldn't run from your network.
Catch-all domains return false positives. Some company domains accept mail to any address; SMTP verification can't distinguish a real mailbox from a catch-all. v1 doesn't try; future improvement is a "deliberately bogus" probe pair.
HTTP Archive is monthly. The latest snapshot lags ~30 days behind real time. A vendor swap that happened last week won't be reflected until next month's crawl.
Shared-platform domains are excluded. WordPress.com, Substack, Quora, and ~40 other shared platforms run support widgets site-wide; their users would otherwise show up as false-positive "Intercom customers." Hydra ships an internal exclusion list covering all of them.
Top-1000 global ranks excluded by default. The CrUX top-1000 is dominated by enterprise giants (Roblox, MercadoLibre, etc.) that Haiku will reject anyway. Default --rank-floor 1000 drops them at SQL time to save Firecrawl + Haiku spend. Override with --rank-floor 0 if you want them included.
Greenfield prospects (no detected vendor) aren't sourced. v1 only surfaces companies that have a competing widget installed. Greenfield-target campaigns require a different sourcing strategy (e.g. industry vertical lists from public directories) — out of scope for now.

Coming soon

Cross-campaign analytics (per-campaign cost + delivery funnel).