Skip to content

fronted/scanner: client-side CDN front discovery (draft)#488

Draft
myleshorton wants to merge 8 commits into
mainfrom
fisk/fronted-scanner
Draft

fronted/scanner: client-side CDN front discovery (draft)#488
myleshorton wants to merge 8 commits into
mainfrom
fisk/fronted-scanner

Conversation

@myleshorton
Copy link
Copy Markdown
Contributor

Draft — companion to lantern-box#TBD (meek outbound).

Summary

Adds a client-side probe-based scanner that turns the existing fronted.yaml.gz masquerades — plus opportunistic CloudFront-range samples and Akamai hostname-regex draws — into a ranked list of (IP, outer SNI, inner Host) tuples that work from this client's network position right now.

Why this can't be a server-curated list: censorship in IR moves faster than our config-push cadence, and per Samim Mirhosseini (developer behind patterniha/MITM-DomainFronting, in this Slack thread) the working fronts are "different for each person depending on what ISP they use, their location and time of day". The right discovery loop is per-client, not per-deploy.

Architecture

Layer 1 — client-side probe (this PR)

Probe(ctx, Candidate, Options) Result performs the full check that any one front actually works for this client:

  1. TCP connect to Candidate.IPAddress:443 via the supplied Dialer
  2. uTLS handshake with ServerName = Candidate.SNI (empty = no SNI sent, Akamai style; non-empty = sent verbatim, CloudFront style)
  3. Cert chain verified against Candidate.VerifyHostname
  4. HTTPS GET to Candidate.TestURL with Host: Candidate.InnerHost
  5. Returns OK only on 2xx — TLS-only success would confirm reachability without confirming the inner Host routes to the right backend

Scan(ctx, []Candidate, Options) []Result runs Probe concurrently with configurable budget.

Layer 2 — candidate generation

Three feeders into the candidate pool, each suited to a different CDN's edge model:

  • CandidatesFromConfig(*domainfront.Config) flattens the existing fronted.yaml.gz masquerades. Pre-validated (IP, SNI) pairs; this is the primary input.
  • CloudFrontCandidates(n, snis, ...) for discovering CloudFront edges beyond the curated list. Embedded snapshot of AWS's 204 CloudFront IPv4 prefixes (cloudfront_prefixes.txt); weighted random sampling pairs IPs with caller-supplied outer SNIs. Expected hit rate is partial — each CloudFront edge serves a subset of distributions per POP, so the probe filters mismatches. Acceptable for discovery.
  • AkamaiCandidates(ctx, hostnames, SystemResolver{}, ...) for discovering Akamai edges via the OS/ISP resolver. Critical: this is the correct path even in IR — the ISP returns real Akamai IPs (Akamai isn't blocked, hosts too much Iranian critical infra) and those IPs are geographically near the client's network. DoH endpoints themselves are blocked in IR.
  • GenerateAkamaiHostnames(n) produces draws from a([1-9]|1[0-9])([0-9]{2})\.(dsc)?(b|d|g|g2|na|r|w7)\.akamai\.net — same regex pattern shipped in Psiphon's server entries and adopted by MahsaNG / Shir-o-Khorshid. ~3,500 hostnames in the regex space, all resolve through the same Akamai general edge property.

For Akamai, VerifyHostname is always set to the canonical a248.e.akamai.net regardless of which regex hostname was used to discover the IP — the regex hostnames aren't in the cert's SAN list, but the edge's default cert always validates against a248.e.akamai.net. This was a non-obvious bug in the initial draft; live-network testing exposed it (cert-mismatch failures with regex-generated VerifyHostnames).

Sequence

sequenceDiagram
    participant App as radiance client
    participant Scn as fronted/scanner
    participant SYS as System Resolver (ISP)
    participant CDN as CDN edge (Akamai/CloudFront)

    Note over App: needs working front
    App->>Scn: Scan(candidates)
    par per candidate
        Scn->>SYS: LookupHost(a248.e.akamai.net) [Akamai feeder]
        SYS-->>Scn: real edge IPs
    end
    loop concurrent probes
        Scn->>CDN: TCP + uTLS(SNI=⟂ or masquerade)
        Scn->>CDN: GET TestURL with Host: api.iantem.io
        CDN-->>Scn: 200 OK / 403 / TLS-mismatch
    end
    Scn-->>App: RankWorking() — sorted by latency
Loading

Test coverage

22 unit tests, all green:

  • Probe success / TCP fail / TLS hostname mismatch / HTTP 500 → not OK
  • Scan concurrency + RankWorking latency ordering
  • CandidatesFromConfig flattening
  • CloudFrontPrefixes weighted sampling
  • Akamai regex output matches the pattern
  • Front candidate dedup, partial DNS failure handling

Plus opt-in (SCANNER_INTEGRATION=1) live-network tests:

  • TestLive_AkamaiSystemResolver: ~100% hit rate (16/16 in latest run; IPs spanning 3 different POP clusters)
  • TestLive_CloudFrontRandomIPs: diagnostic only — hit rate is partial as expected
  • TestLive_CloudFrontKnownMasquerades: diagnostic — confirms how fronted.yaml.gz stale entries filter out

What's NOT in this PR

  • Consumer wiring: no existing radiance code calls the scanner yet. This PR delivers the primitive; integration with kindling/domainfront (refresh its working-pool from scanner output) or the lantern-box meek outbound (feed scanner-discovered fronts as Fronts config) happens separately.
  • Bandit aggregation across users: covered by getlantern/engineering#3525. This client-side scanner is the authoritative source per Samim's design; the server-side aggregation is an optional accelerator on top.
  • Meek transport using these fronts: see lantern-box's meek outbound PR (sibling).

Reference

🤖 Generated with Claude Code

Adds a probe-based scanner that turns the existing fronted.yaml.gz
masquerades — plus opportunistic CloudFront-range samples and Akamai
hostname-regex draws — into a ranked list of (IP, outer SNI, inner
Host) tuples that work from the client's network position.

Why this exists at all: censorship in IR moves fast enough that a
config push isn't a tight enough loop, and the working fronts are
per-(ISP, geography, time-of-day) per Samim Mirhosseini. The scanner
runs client-side and reports per-client truth.

Pieces:
- scanner.go: Candidate / Result / Probe / Scan / RankWorking. Probe
  does TCP + uTLS handshake + HTTPS GET to TestURL with the inner
  Host header. Only OK on a 2xx.
- candidates.go: CandidatesFromConfig flattens domainfront.Config
  into the primary probe pool. SNIsForProvider extracts the
  masquerade-domain pool for use with CloudFrontCandidates.
- cloudfront.go: 204 CloudFront IPv4 prefixes embedded; weighted
  random sampling pairs IPs with caller-supplied outer SNIs.
- akamai.go: SystemResolver (OS/ISP resolver — the ISP is the right
  source in IR). Akamai candidates leave SNI empty matching
  fronted.yaml.gz and verify against AkamaiCertHostname for every
  entry. GenerateAkamaiHostnames produces the Psiphon/MahsaNG regex
  pattern.

22 unit tests, plus opt-in (SCANNER_INTEGRATION=1) live-network
tests. Akamai integration: ~100% hit rate against the canonical
edge hostname.
Adds the layer on top of the probe primitives: a Service that runs
scans on a schedule, persists working fronts to disk, exposes a
round-robin Pick API for consumers, and re-scans when consumers
report failures.

Lifecycle:
- NewService(cfg) loads any prior cache (filtered by CacheTTL so
  stale entries don't seed the live pool with already-blocked IPs)
- Start(ctx) runs the periodic refresh loop until ctx is canceled
  or Close is called
- Working() returns the current ranked list; Pick() returns the
  next one round-robin so all working fronts get traffic rather
  than every dial pinning to the lowest-latency entry
- ReportFailure(c) tracks per-front failures; after two failures
  within a refresh cycle the front is dropped, and if the working
  list falls below MinWorkingFronts a refresh is signaled
- Refresh() is a manual trigger

BuildPool composes candidates from the three feeders (known
masquerades from fronted.yaml.gz, regex-generated Akamai hostnames
resolved via SystemResolver, random CloudFront IPs paired with
masquerade SNIs). Sample sizes <= 0 disable a feeder.

Cache schema is versioned JSON written atomically (write tmp +
rename). Missing file is not an error — first-boot loads nothing
and proceeds to the first scan.

Defaults: RefreshInterval 1h, CacheTTL 6h (matches Samim's
"time-of-day" observation that working fronts shift on roughly
that timescale), MinWorkingFronts 3.

Tests: 11 new (cache save/load/TTL/missing/version + service
round-robin/empty/failure-removal/low-water-signal/cache-restore/
no-config-is-error + BuildPool known-only and CloudFront paths).
Adds the consumer layer that converts the scanner.Service's working
list into []FrontSpec entries ready for the lantern-box meek
outbound's JSON configuration. Provider owns the Service lifecycle,
wires the bypass dialer so probes don't loop through the active VPN
TUN, and uses TrustedCAsPool from the loaded domainfront config so
cert validation matches production.

FrontSpec is a local mirror of lantern-box/option.FrontSpec — same
JSON shape, kept local to avoid version-coupling radiance to
lantern-box's release cadence (the meek option type lands in
lantern-box#265 and isn't published yet).

Service lifecycle fix: Close no longer hangs when Start was never
called. NewProvider returns an error for nil Config instead of
panicking inside TrustedCAsPool.

Adds a live-network timing benchmark (TestLive_TimeToFirstWorking,
gated on SCANNER_INTEGRATION=1) that loads the production
fronted.yaml.gz, builds a 70+ candidate pool, runs a full scan, and
reports time-to-first-working / total scan time / per-feeder hit
rate / per-probe latency p50/p90. On a sample run from a US dev
network:

- pool: 72 candidates (50 known + Akamai-DNS-resolved + 10 CloudFront-random)
- time to first working front: 205ms
- scan complete: 35/72 working in 8.79s
- akamai: 35/36 working (97%)
- cloudfront: 0/36 working (0%) — fronted.yaml.gz cloudfront testurl is stale
- per-probe latency: p50=218ms, p90=1.47s, min=142ms

Sub-second time-to-usability means a cold-boot client gets a working
front before the user notices. CloudFront's 0% is the known
POP-vs-distribution issue (#3525); production deployment with a
fresh, globally-served test URL would lift that.
Flips the default candidate pool composition so per-scan-fresh IPs
from the AWS CloudFront prefix list and DNS-resolved Akamai edges
are the primary discovery source, with the pre-resolved IPs in
fronted.yaml.gz reduced to opt-in via KnownSample > 0.

Why: the YAML's pre-resolved IPs are the same baked list every user
gets and don't move per (ISP, location, time-of-day). The raw-range
feeders self-heal as CDN edges rotate and produce per-user-fresh
candidates — matching Samim Mirhosseini's observation that the
working fronts vary across all three dimensions.

BuildPool semantic change: KnownSample <= 0 now skips the known
feeder entirely (previously it meant "use all known"). Callers
explicitly opt in by passing KnownSample > 0.

Provider defaults: KnownSample removed from defaults() (defaults to
0 → skip), CloudFrontSample=30, AkamaiSample=3 (4 hostnames after
adding canonical → typically ~8 unique IPs after DNS dedup).

Re-ran the live timing benchmark with new defaults from a US dev
network against the production fronted.yaml.gz:

- pool: 38 candidates (30 CloudFront-raw + 8 Akamai-DNS-resolved)
- time to first working front: 154ms (was 205ms)
- scan complete: 8/38 working in 10.7s
- akamai: 8/8 working (100%)
- cloudfront: 0/30 working (0%) — stale testurl in YAML
- per-probe latency: p50=244ms p90=292ms min=154ms

Tail latency tightened (p90 1.47s → 292ms) because the working pool
is now uniformly fresh rather than mixing pre-resolved IPs of
varying age. CloudFront's 0% is a fixable production deployment
issue (fresh globally-served distribution), not a discovery flaw.

Sub-200ms time-to-first-working means cold-boot clients have a
working front before the user notices.
http.Transport routes via DialTLSContext (our pre-opened fronted TLS
conn) only for https URLs. With an http:// TestURL the request fell
through to plain DNS + port 80, bypassing the front entirely — every
probe was effectively a direct-DNS plaintext request to the inner
hostname instead of a fronted request via the chosen CDN edge.

Akamai's TestURL in fronted.yaml.gz is https:// so its probes were
fine; CloudFront's is http:// so its probes were structurally broken.

The fix surfaces a separate finding: even with probes routed
correctly, CloudFront returns HTTP 421 "Misdirected Request" for
every (random IP × masquerade SNI) pair AND for every pre-validated
pair in fronted.yaml.gz. AWS now strictly enforces SNI/Host match,
killing the cross-distribution Host header routing technique our
YAML attempts. CloudFront fronting via this scheme is not just stale
data — it's structurally disabled at the AWS layer.

Workable CloudFront fronting requires alternate-domain-names on the
same distribution (outer SNI and inner Host both belong to one
CloudFront distribution AWS owns the cert for), which is a different
deployment than fronted.yaml.gz uses today. Tracking as follow-up.
CloudFront fronting works when the client sends no SNI extension and
keeps the inner Host in the request. The TLS handshake completes with
CloudFront's default *.cloudfront.net cert (or a customer cert pinned
to that edge); CloudFront then routes by inner Host alone since no
SNI claims a different distribution.

Sending a non-empty SNI triggered HTTP 421 "Misdirected Request"
because CloudFront strictly enforces SNI/Host match — exactly the
behavior the earlier 0% hit rate exposed. Production's
fronted.yaml.gz CloudFront masquerades have always shipped with
sni: "" for the same reason; the bug was in my scanner's
CloudFrontCandidates setting SNI = masquerade-domain.

Two changes in CloudFrontCandidates:
- SNI: "" (was masquerade-domain) — sidesteps 421 enforcement.
- VerifyHostname: InnerHost (was masquerade-domain) — when no SNI,
  CloudFront serves either the *.cloudfront.net default cert (which
  wildcards the inner Host) or a customer-pinned cert. Verifying
  against InnerHost filters to the former, where cross-distribution
  Host routing actually reaches our backend. Verifying against the
  masquerade-domain rejected the wildcard cert and lost the working
  cases.

Live-network results after the fix:
- CloudFront random sampling: 1-3/30 working (3-8%) — was 0/30.
  The hit rate is structural (POP-vs-distribution coverage); each
  hit is an edge that genuinely routes to our distribution.
- Akamai: 100% unchanged.
- Time to first working front: 149ms.
Adds the radiance-side wiring that takes a FrontSpec list (from the
fronted/scanner Service via kindling/meek.Provider) and turns it into
a sing-box outbound the live tunnel can route through.

Two pieces:

1. kindling/meek.BuildOutbound(tag, url, fronts) constructs a
   sing-box O.Outbound with Type="meek" and a local
   MeekOutboundOptions struct whose JSON shape mirrors
   lantern-box/option.MeekOutboundOptions exactly. The local copy
   sidesteps the lantern-box version-coupling: lantern-box v0.0.82
   doesn't have the meek outbound type registered, so we can't
   import lbO.MeekOutboundOptions today. Once the lantern-box
   bump lands the local copy + MeekOutboundType constant can be
   replaced one-for-one with the upstream symbols.

   Returns ok=false when fronts is empty so callers skip injection
   when the scanner hasn't produced anything yet.

2. vpn.BoxOptions gains an optional MeekOutbound *O.Outbound field.
   buildOptions injects it into Outbounds and appends its Tag to
   the selector tags list immediately after mergeAndCollectTags
   (and before the auto/manual selector outbounds are built) so
   the meek outbound participates in routing alongside
   API-supplied ones. Nil = no-op, no behavior change for callers
   that don't set it.

Until lantern-box's meek type is registered in radiance's pinned
version, setting MeekOutbound is a no-op end-to-end — libbox will
reject the unknown "meek" type at config unmarshal. The wiring is
ready; activation flips when (a) lantern-box bumps and (b) the
caller (whoever owns the VPNClient) populates MeekOutbound from a
meek.Provider's FrontSpecs.

Tests: 2 new in kindling/meek (BuildOutbound empty-fronts/shape),
2 new in vpn (MeekInjection/MeekOmittedWhenNil) confirming the
selector tag list includes the meek tag and Outbounds is augmented
correctly.
Single source of truth for the meek-server URL the production wiring
will dial through Akamai. End-to-end verified 2026-05-23: domain-fronted
POST returns the echoed payload in ~470ms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant