Welcome to the comprehensive documentation for CrawlX 2.0 - the modern, powerful web crawling and scraping library for Node.js.
- Installation & Quick Start - Get up and running quickly
- Basic Examples - Common use cases and patterns
- Configuration Guide - Complete configuration reference
- Architecture Overview - Understanding CrawlX's design
- Parsing System - CSS selectors and data extraction
- Plugin System - Extensibility and customization
- Task Scheduling - Concurrency and resource management
- Advanced Examples - Real-world scraping scenarios
- Plugin Development - Creating custom plugins
- Performance Tuning - Optimization strategies
- Error Handling - Robust error management
- Testing - Testing your crawlers
- Core API - Complete API documentation
- Types - TypeScript type definitions
- Factory Functions - Convenience functions
- Configuration Schema - Configuration options
- Custom HTTP Clients - Implementing custom clients
- Data Pipelines - Processing crawled data
- Monitoring & Observability - Production monitoring
- Deployment - Production deployment strategies
- Migration from v1.x - Upgrading guide
- Breaking Changes - What's changed
- Compatibility - Browser and Node.js support
npm install crawlximport { quickCrawl } from 'crawlx';
// Simple data extraction
const result = await quickCrawl('https://example.com', {
title: 'title',
description: 'meta[name="description"]@content'
});
console.log(result.parsed);import { createScraper, createSpider } from 'crawlx';
// Create a scraper for data extraction
const scraper = createScraper();
const data = await scraper.crawl('https://example.com', {
parse: { title: 'title', links: '[a@href]' }
});
// Create a spider for link following
const spider = createSpider();
const results = await spider.crawlMany(['https://example.com'], {
parse: { title: 'title' },
follow: '[a@href]'
});CrawlX 2.0 is built with a modular architecture:
βββββββββββββββββββ
β CrawlX Core β β Main orchestrator
βββββββββββββββββββ€
β Plugin Manager β β Extensibility layer
βββββββββββββββββββ€
β Task Scheduler β β Concurrency & queuing
βββββββββββββββββββ€
β HTTP Client β β Network layer
βββββββββββββββββββ€
β Parser Engine β β Data extraction
βββββββββββββββββββ€
β Config Manager β β Configuration
βββββββββββββββββββ
- CrawlX Core: Main crawler class that orchestrates all operations
- Plugin Manager: Handles plugin lifecycle and hook execution
- Task Scheduler: Manages task queuing, prioritization, and resource limits
- HTTP Client: Handles HTTP requests with multiple modes (lightweight/high-performance)
- Parser Engine: CSS selector-based parsing with filters and transformations
- Config Manager: Schema-based configuration with validation and environment support
CrawlX features a powerful plugin system with built-in plugins:
- ParsePlugin: Data extraction and parsing
- FollowPlugin: Link following and discovery
- RetryPlugin: Automatic retry with exponential backoff
- DelayPlugin: Request delays and politeness
- DuplicateFilterPlugin: URL deduplication
- RateLimitPlugin: Advanced rate limiting
class CustomPlugin {
name = 'custom';
version = '1.0.0';
priority = 100;
async onTaskComplete(result) {
result.customData = { processed: true };
return result;
}
}
const crawler = new CrawlX();
crawler.addPlugin(new CustomPlugin());Powerful CSS selector-based parsing with filters:
const parseRule = {
// Simple selectors
title: 'title',
links: '[a@href]',
// Nested structures
products: {
_scope: '.product',
name: '.name',
price: '.price | trim | number',
details: {
_scope: '.details',
description: '.desc',
specs: ['.spec']
}
},
// Custom functions
timestamp: () => new Date().toISOString(),
productCount: ($) => $('.product').length
};- Dual Modes: Lightweight and high-performance modes
- Smart Concurrency: Configurable concurrent request handling
- Connection Pooling: Efficient HTTP connection management
- Caching: Response caching with TTL support
- Rate Limiting: Token bucket-based rate limiting
- Memory Management: Resource monitoring and limits
Comprehensive error handling with custom error types:
import { CrawlXError, NetworkError, TimeoutError } from 'crawlx';
try {
const result = await crawler.crawl('https://example.com');
} catch (error) {
if (error instanceof NetworkError) {
console.log('Network error:', error.statusCode);
} else if (error instanceof TimeoutError) {
console.log('Timeout after:', error.timeout);
}
}Built-in statistics and monitoring:
const stats = crawler.getStats();
console.log({
isRunning: stats.isRunning,
results: stats.results,
scheduler: stats.scheduler,
httpClient: stats.httpClient,
plugins: stats.plugins
});Flexible configuration with multiple sources:
// Code configuration
const crawler = new CrawlX({
mode: 'high-performance',
concurrency: 10,
plugins: {
delay: { enabled: true, defaultDelay: 1000 }
}
});
// Environment variables
CRAWLX_MODE=high-performance
CRAWLX_CONCURRENCY=10
CRAWLX_PLUGINS_DELAY_ENABLED=true
// Configuration presets
import { ConfigPresets } from 'crawlx';
const prodCrawler = ConfigPresets.production();We welcome contributions! Please see our Contributing Guide for details.
CrawlX is released under the MIT License.
- GitHub Issues - Bug reports and feature requests
- GitHub Discussions - Community support
- Documentation - Comprehensive guides and API reference
Ready to start crawling? Check out the Getting Started Guide to begin your journey with CrawlX!