Skip to content

[comp] Production Deploy#2574

Merged
Marfuen merged 2 commits into
releasefrom
main
Apr 16, 2026
Merged

[comp] Production Deploy#2574
Marfuen merged 2 commits into
releasefrom
main

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot commented Apr 16, 2026

This is an automated pull request to release the candidate branch into production, which will trigger a deployment.
It was created by the [Production PR] action.


Summary by cubic

Adds a trust-portal deep scrape to reliably extract certifications (even from SPA trust centers) and merges them with the Firecrawl Agent results. Also extends timeouts and fixes response parsing to prevent empty or partial assessments.

  • New Features

    • Added deep-scrape that discovers anchors or clicks SPA tabs, aggregates markdown, extracts certifications, and merges with core results (deduped by slug with status priority).
    • Reworked core agent: prompt now prioritizes returning trust_center_url, JSON schema extracted, and seed URLs expanded for better portal discovery.
    • Increased task max duration to 30 minutes; added logs and persisted complianceBadgesJson and certificationsInAssessmentJson.
  • Bug Fixes

    • Handle non-completed Firecrawl statuses and bump agent timeouts to 25 minutes for core/news; retry once on fetch failures to avoid silent empties.
    • Parse agent payload by scoring candidates so populated .data wins over empty wrappers.
    • Gate and sanitize URLs: pick on-domain sources, skip known third‑party portals, and escape CSS selectors during scraping.

Written for commit 08a3786. Summary will update on new commits.

github-actions Bot and others added 2 commits April 16, 2026 19:53
* fix(vendor): harden firecrawl trust center crawling

* refactor(vendor): export TRUSTED_PORTAL_DOMAINS and add host check helper

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(vendor): add trust portal section-url discovery helper

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(vendor): add certification merge helper with status priority

Pure mergeCertifications function dedupes by canonical slug and resolves
status via verified > expired > unknown > not_certified priority, preferring
core URL/dates on ties.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(vendor): scaffold trust portal deep-scrape orchestrator with gate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(vendor): implement trust portal deep-scrape orchestrator

Clicks through SPA sidebar sections, concatenates markdown from each,
and extracts certifications via Claude Sonnet 4.6.

* fix(vendor): escape CSS selector values and cover concurrency bound

Add cssEscapeAttr helper to sanitize `\` and `"` inside CSS double-quoted
attribute values in buildSectionScrapeOptions, preventing silent selector
no-ops for anchor slugs containing CSS-reserved characters. Add two new
tests: one verifying the escaping (using `\` which survives URL normalization)
and one confirming mapWithConcurrency covers all items when section count (8)
exceeds SECTION_CONCURRENCY (5).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(vendor): run trust portal deep-scrape after core agent

Resolves a source URL (trust center -> security page -> verified cert url),
runs deepScrapeTrustPortal, and merges certifications before returning.

* refactor(vendor): extract pickDeepScrapeSourceUrl and tighten extraction prompt

Move pickDeepScrapeSourceUrl into its own module with unit tests so
firecrawl-agent-core.ts drops below the 300-line limit. Also hoist the
Firecrawl Agent JSON schema into firecrawl-agent-schema-json.ts for the
same reason. Tighten the Sonnet 4.6 extraction prompt to explicitly
require evidence_snippet so Claude doesn't silently drop rows.

* feat(vendor): log Agent snapshot, deep-scrape decision, and persisted certs

Adds three diagnostic logs so a trigger.dev run tells the full story:

- "Firecrawl Agent returned — pre-deep-scrape snapshot" dumps the raw
  Agent links, normalized links, and cert types/statuses before the
  deep-scrape decision. Exposes what the LLM actually found.

- Deep-scrape branch logs either "source URL resolved" + merged types,
  "returned no certifications", or "skipped: no usable URL on vendor
  domain" with available links + verified certs — no more silent
  gate decisions.

- "Risk level and badges extracted" now includes the full compliance
  badge payload and the certifications array being persisted to the
  vendor record, so DB-write state is inspectable from logs.

* fix(vendor): json-stringify complex diagnostic log fields

Trigger.dev's OpenTelemetry attribute pipeline strips nested objects
and arrays — keeping only top-level scalars — so rich log payloads
like rawAgentLinks, normalizedLinks, and complianceBadges were being
silently discarded. Serialize them to JSON strings so they survive
the OTel export and surface in the dashboard / MCP span details.

* feat(vendor): rewrite Firecrawl Agent prompt — URL-discovery first

Prior prompt treated trust_center_url as just another field, so when the
Agent failed to extract certifications from a JavaScript SPA (e.g.
ui.com/trust-center) it abandoned the whole output — including the URL
the downstream deep-scrape needs.

New prompt reframes the mission:
- Primary goal: return trust_center_url even when page content is empty
  or SPA-only. Deep-scrape handles rendering; Agent just has to find.
- Explicit numbered URL paths to try when nav discovery fails, including
  third-party portals keyed off the vendor slug.
- Explicit instruction to return URLs of SPA-only pages rather than
  discarding them.
- Stricter output contract marking trust_center_url as REQUIRED when
  any trust/security/compliance surface exists on the vendor domain.
- Bumped maxCredits 2500 → 4000 to give the Agent headroom on sites
  that require multi-hop discovery.

Prompt extracted into firecrawl-agent-prompt.ts to keep core orchestrator
under the 300-line limit.

* chore(vendor): log raw firecrawl agent response for ui.com diagnosis

Adds temporary diagnostic logs capturing:
- agentResponse.success / status / error / keys (before schema parse)
- first 4KB of the raw agentResponse JSON
- first 4KB of parsed.data JSON, plus security_assessment and risk_level

The agent is returning links: null for ubiquiti even after the URL-first
prompt rewrite — need to see what it IS returning to understand whether
it's a fetch block, a model compliance issue, or a parse path we're
missing. Pushes the file to 315 lines; will roll back once diagnosed.

* fix(vendor): handle firecrawl agent processing status + extend timeouts

Discovered via new diagnostic log: the Firecrawl SDK's agent call was
returning status="processing" on ui.com because its internal poll timed
out (360s) before the agent job completed on Firecrawl's side. Our code
only guarded against status="failed", so it silently parsed the empty
response as success — leaving vendor records with no certifications
even when the agent could have found them given more time.

Changes:
- Guard on status !== "completed" instead of just "failed"; log clearly
  when SDK returns while job is still processing so timeouts are
  visible instead of silent.
- Bump agent SDK timeout 360s -> 1500s (25 min) so slow SPA trust
  centers like Ubiquiti have room to finish.
- Bump task maxDuration 10 min -> 30 min to accommodate the longer
  agent call plus deep-scrape + DB writes.

* fix(vendor): score agent payload candidates by populated fields

The firecrawl agent response has a nested shape:
  { success, status, data: { links, certifications, ... }, ... }

extractAgentPayloadCandidates returns [wrapper, wrapper.data] in that
order, and every field in vendorRiskAssessmentAgentSchema is optional.
The wrapper therefore parsed successfully as an empty object and won
the first-match .find() lookup — even though it contained no real
fields. The actual .data payload (with trust_center_url, security
page, privacy policy, etc.) was silently discarded.

Pick the candidate with the most populated schema fields instead of
the first success. This has been a latent bug on main — the ubiquiti
run on v20260415.12 showed the same "found 0 links, 0 certifications"
symptom.

* fix(vendor): remove invalid maxCredits from scrape calls

Firecrawl's v2 /scrape endpoint rejects maxCredits — that option
belongs to the Agent API, not scrape. We were passing it on both
the initial scrape and the per-section scrapes, and Firecrawl was
returning "Unrecognized key in body", causing the deep-scrape pass
to fail on its very first call.

Replace with `timeout` (2 min per scrape, within Firecrawl's 5-min
cap) which is the scrape v2 equivalent of "budget per call."

* chore(vendor): log raw initial scrape output for section discovery diag

Ubiquiti run finished with sectionCount=0 even though the initial
scrape returned 9891 chars of markdown. Need to see what
firecrawlClient.scrape actually returned in `links` to understand
whether the sidebar items are missing from the response or whether
discoverSectionUrls is wrongly filtering them out.

Logs the first 50 links and the first 2KB of markdown from the initial
scrape. Temporary diagnostic, will trim once the sidebar discovery
strategy is fixed.

* feat(vendor): llm-driven tab discovery for spa trust portals

Ubiquiti's trust center sidebar items are <button>/<div onClick>
elements with no href, so Firecrawl's `links` format returns 0 anchor
URLs for them. URL-based section discovery then had nothing to work
with and the deep-scrape only ever saw the landing tab.

Add a tab-discovery step: when URL-based discovery yields zero
sections, pass the initial markdown to Claude Sonnet 4.6 to identify
sidebar labels, then scrape each one with an executeJavascript
click-by-text action. The click script finds the matching element by
exact textContent, scrolls it into view, and clicks it. Works for any
SPA that has tab labels visible in the rendered markdown — not just
Ubiquiti.

Flow:
  1. Initial scrape -> markdown + links
  2. URL-based discovery (existing, unchanged)
  3. If urlSections.length === 0 and markdown non-empty,
     call identifySidebarTabs to get labels from the LLM
  4. Merge url-based + tab-label sections, dedupe by label, cap at 25
  5. Per-section scrape with click-by-text OR click-by-href
  6. Combine markdown, extract certs, merge

Files:
  new   trust-portal-deep-scrape-tabs.ts  (92 lines)
  edit  trust-portal-deep-scrape.ts       (+70 lines)
  edit  trust-portal-deep-scrape-sections.ts  (+tabLabel field)
  edit  trust-portal-deep-scrape.spec.ts  (1 new test, 3 updated)

* fix(vendor): apply same processing-status + timeout fixes to news agent

firecrawlResearchNews had the exact two bugs we already fixed for
firecrawlResearchCore:

1. Status guard was too loose (only `=== 'failed'`), so when the SDK
   returned `status: 'processing'` (Firecrawl still running the job
   after our SDK poll timed out) we silently proceeded to read
   agentResponse.data.news, got undefined, and logged "no news items."

2. Timeout was 360s while matching agent jobs for slow vendor sites
   routinely take 6+ minutes. Ubiquiti run hit 6m 1s and returned empty,
   matching the timeout boundary almost exactly.

Bump timeout 360s -> 1500s (matches core), guard on `!== 'completed'`,
and add the same diagnostic logs we added to core so future runs surface
the raw agent response + data shape when news comes back empty.

* refactor(vendor): extract payload + scrape-option helpers, trim verbose logs

Post-debugging cleanup. No behavior change.

Files split so both orchestrators drop back under the 300-line rule:
  - firecrawl-agent-payload.ts (58) — asRecord, extractAgentPayloadCandidates,
    countPopulatedAgentFields. Moved out of firecrawl-agent-core.ts so the
    payload-candidate logic can be shared and tested separately.
  - trust-portal-deep-scrape-scrape-options.ts (107) — cssEscapeAttr,
    buildClickByTextScript, buildInitialScrapeOptions, buildSectionScrapeOptions.
    Moved out of trust-portal-deep-scrape.ts so the scrape-option + click-by-text
    JS builders are isolated from the orchestration code.

Log trimming — drop the 4KB agent-response and 2KB markdown-head dumps from
happy-path logs. They were added for live diagnosis and landed big blobs in
every prod run. Keep scalar summary fields. Full raw-response JSON now only
logged on the exceptional "not completed" warning path where it is actually
useful, not on every successful run.

File line counts:
  firecrawl-agent-core.ts         315 -> 296
  trust-portal-deep-scrape.ts     383 -> 293
  firecrawl-agent-news.ts         172 -> 158

67/67 tests still pass.

---------

Co-authored-by: Mariano Fuentes <marfuen98@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
comp-framework-editor Ready Ready Preview, Comment Apr 16, 2026 8:10pm
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
app (staging) Skipped Skipped Apr 16, 2026 8:10pm
portal (staging) Skipped Skipped Apr 16, 2026 8:10pm

Request Review

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 19 files

Requires human review: This PR contains significant business logic changes, including a new deep-scrape feature, reworked core agent prompts, and increased task timeouts, requiring human review.

@Marfuen Marfuen merged commit a840682 into release Apr 16, 2026
14 checks passed
@claudfuen
Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 3.23.1 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants