Skip to content

Latest commit

Β 

History

History
268 lines (206 loc) Β· 7.76 KB

File metadata and controls

268 lines (206 loc) Β· 7.76 KB

CrawlX Documentation

Welcome to the comprehensive documentation for CrawlX 2.0 - the modern, powerful web crawling and scraping library for Node.js.

πŸ“š Table of Contents

Getting Started

Core Concepts

Guides

API Reference

Advanced Topics

Migration & Compatibility

πŸš€ Quick Start

Installation

npm install crawlx

Basic Usage

import { quickCrawl } from 'crawlx';

// Simple data extraction
const result = await quickCrawl('https://example.com', {
  title: 'title',
  description: 'meta[name="description"]@content'
});

console.log(result.parsed);

Factory Functions

import { createScraper, createSpider } from 'crawlx';

// Create a scraper for data extraction
const scraper = createScraper();
const data = await scraper.crawl('https://example.com', {
  parse: { title: 'title', links: '[a@href]' }
});

// Create a spider for link following
const spider = createSpider();
const results = await spider.crawlMany(['https://example.com'], {
  parse: { title: 'title' },
  follow: '[a@href]'
});

πŸ—οΈ Architecture Overview

CrawlX 2.0 is built with a modular architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   CrawlX Core   β”‚ ← Main orchestrator
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Plugin Manager  β”‚ ← Extensibility layer
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Task Scheduler  β”‚ ← Concurrency & queuing
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  HTTP Client    β”‚ ← Network layer
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Parser Engine   β”‚ ← Data extraction
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Config Manager  β”‚ ← Configuration
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

  • CrawlX Core: Main crawler class that orchestrates all operations
  • Plugin Manager: Handles plugin lifecycle and hook execution
  • Task Scheduler: Manages task queuing, prioritization, and resource limits
  • HTTP Client: Handles HTTP requests with multiple modes (lightweight/high-performance)
  • Parser Engine: CSS selector-based parsing with filters and transformations
  • Config Manager: Schema-based configuration with validation and environment support

🧩 Plugin System

CrawlX features a powerful plugin system with built-in plugins:

  • ParsePlugin: Data extraction and parsing
  • FollowPlugin: Link following and discovery
  • RetryPlugin: Automatic retry with exponential backoff
  • DelayPlugin: Request delays and politeness
  • DuplicateFilterPlugin: URL deduplication
  • RateLimitPlugin: Advanced rate limiting

Custom Plugin Example

class CustomPlugin {
  name = 'custom';
  version = '1.0.0';
  priority = 100;

  async onTaskComplete(result) {
    result.customData = { processed: true };
    return result;
  }
}

const crawler = new CrawlX();
crawler.addPlugin(new CustomPlugin());

πŸ“Š Data Extraction

Powerful CSS selector-based parsing with filters:

const parseRule = {
  // Simple selectors
  title: 'title',
  links: '[a@href]',
  
  // Nested structures
  products: {
    _scope: '.product',
    name: '.name',
    price: '.price | trim | number',
    details: {
      _scope: '.details',
      description: '.desc',
      specs: ['.spec']
    }
  },
  
  // Custom functions
  timestamp: () => new Date().toISOString(),
  productCount: ($) => $('.product').length
};

⚑ Performance Features

  • Dual Modes: Lightweight and high-performance modes
  • Smart Concurrency: Configurable concurrent request handling
  • Connection Pooling: Efficient HTTP connection management
  • Caching: Response caching with TTL support
  • Rate Limiting: Token bucket-based rate limiting
  • Memory Management: Resource monitoring and limits

πŸ›‘οΈ Error Handling

Comprehensive error handling with custom error types:

import { CrawlXError, NetworkError, TimeoutError } from 'crawlx';

try {
  const result = await crawler.crawl('https://example.com');
} catch (error) {
  if (error instanceof NetworkError) {
    console.log('Network error:', error.statusCode);
  } else if (error instanceof TimeoutError) {
    console.log('Timeout after:', error.timeout);
  }
}

πŸ“ˆ Monitoring

Built-in statistics and monitoring:

const stats = crawler.getStats();
console.log({
  isRunning: stats.isRunning,
  results: stats.results,
  scheduler: stats.scheduler,
  httpClient: stats.httpClient,
  plugins: stats.plugins
});

πŸ”§ Configuration

Flexible configuration with multiple sources:

// Code configuration
const crawler = new CrawlX({
  mode: 'high-performance',
  concurrency: 10,
  plugins: {
    delay: { enabled: true, defaultDelay: 1000 }
  }
});

// Environment variables
CRAWLX_MODE=high-performance
CRAWLX_CONCURRENCY=10
CRAWLX_PLUGINS_DELAY_ENABLED=true

// Configuration presets
import { ConfigPresets } from 'crawlx';
const prodCrawler = ConfigPresets.production();

πŸ“– Learning Path

Beginner

  1. Getting Started
  2. Basic Examples
  3. Configuration Guide

Intermediate

  1. Advanced Examples
  2. Plugin Development
  3. Performance Tuning

Advanced

  1. Custom HTTP Clients
  2. Data Pipelines
  3. Production Deployment

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

πŸ“„ License

CrawlX is released under the MIT License.

πŸ†˜ Support


Ready to start crawling? Check out the Getting Started Guide to begin your journey with CrawlX!