MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction
Epic: EPIC-025 — #113
Priority: P0
Estimate: 5 SP
Depends on: MAGIK-934
Description
Create WebsiteProvider that crawls a given website URL and extracts profile-relevant data from HTML meta tags, Open Graph tags, JSON-LD structured data, schema.org markup, and visible page content.
Implementation
Class: Libraries/Enrichment/WebsiteProvider.php
URL matching: Any valid HTTP/HTTPS URL not matched by social-specific providers.
Extraction layers (in priority order):
- JSON-LD / schema.org —
<script type="application/ld+json"> for Organization, LocalBusiness, Person
- Open Graph tags —
og:title, og:description, og:image, og:url, og:site_name
- HTML meta tags —
<meta name="description">, <meta name="author">, <link rel="icon">
- Visible content heuristics — regex for emails (
mailto:), phone patterns, address blocks
- Social link discovery —
<a href> matching known social platform URL patterns
Extracted fields:
| Field |
Source |
Confidence |
| Company/Site name |
JSON-LD > OG > <title> |
0.9 / 0.8 / 0.6 |
| Description |
JSON-LD > OG > meta description |
0.9 / 0.8 / 0.7 |
| Logo/Favicon |
JSON-LD logo > OG image > <link rel="icon"> |
0.9 / 0.7 / 0.5 |
| Emails |
JSON-LD > mailto: links > regex |
0.9 / 0.8 / 0.5 |
| Phones |
JSON-LD > tel: links > regex |
0.9 / 0.8 / 0.4 |
| Address |
JSON-LD PostalAddress > address block regex |
0.9 / 0.3 |
| Social links |
<a> href matching fb/ig/yt/tw/li patterns |
0.8 |
HTTP client: CodeIgniter's CURLRequest with 10s timeout, User-Agent: MagikTap-Enrichment/1.0, robots.txt check.
Files
| File |
Action |
Libraries/Enrichment/WebsiteProvider.php |
Create |
Libraries/Enrichment/HtmlExtractor.php |
Create (shared HTML parsing utility) |
Libraries/EnrichmentService.php |
Modify (register provider) |
Config/Enrichment.php |
Modify (add website config) |
Acceptance Criteria
MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction
Epic: EPIC-025 — #113
Priority: P0
Estimate: 5 SP
Depends on: MAGIK-934
Description
Create
WebsiteProviderthat crawls a given website URL and extracts profile-relevant data from HTML meta tags, Open Graph tags, JSON-LD structured data, schema.org markup, and visible page content.Implementation
Class:
Libraries/Enrichment/WebsiteProvider.phpURL matching: Any valid HTTP/HTTPS URL not matched by social-specific providers.
Extraction layers (in priority order):
<script type="application/ld+json">for Organization, LocalBusiness, Personog:title,og:description,og:image,og:url,og:site_name<meta name="description">,<meta name="author">,<link rel="icon">mailto:), phone patterns, address blocks<a href>matching known social platform URL patternsExtracted fields:
<title><link rel="icon">mailto:links > regextel:links > regex<a>href matching fb/ig/yt/tw/li patternsHTTP client: CodeIgniter's
CURLRequestwith 10s timeout,User-Agent: MagikTap-Enrichment/1.0, robots.txt check.Files
Libraries/Enrichment/WebsiteProvider.phpLibraries/Enrichment/HtmlExtractor.phpLibraries/EnrichmentService.phpConfig/Enrichment.phpAcceptance Criteria
canHandleUrl()matches any valid HTTP/HTTPS URL (fallback provider)