Skip to content

πŸ•·οΈ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. πŸ€– Hassle-free web scraping made simple.

License

Notifications You must be signed in to change notification settings

ScrapeGraphAI/scrapegraph-py

Repository files navigation

🌐 ScrapeGraph AI SDK

License Python SDK Documentation

Official Python SDK for the ScrapeGraph AI API - Intelligent web scraping and search powered by AI. Extract structured data from any webpage or perform AI-powered web searches with natural language prompts.

Get your API key! API Banner

Features

  • πŸ€– SmartScraper: Extract structured data from webpages using natural language prompts
  • πŸ” SearchScraper: AI-powered web search with structured results and reference URLs
  • πŸ“ Markdownify: Convert any webpage into clean, formatted markdown
  • πŸ•·οΈ SmartCrawler: Intelligently crawl and extract data from multiple pages
  • πŸ€– AgenticScraper: Perform automated browser actions with AI-powered session management
  • πŸ“„ Scrape: Convert webpages to HTML with JavaScript rendering and custom headers
  • ⏰ Scheduled Jobs: Create and manage automated scraping workflows with cron scheduling
  • πŸ’³ Credits Management: Monitor API usage and credit balance
  • πŸ’¬ Feedback System: Provide ratings and feedback to improve service quality

πŸš€ Quick Links

ScrapeGraphAI offers seamless integration with popular frameworks and tools to enhance your scraping capabilities. Whether you're building with Python, using LLM frameworks, or working with no-code platforms, we've got you covered with our comprehensive integration options..

You can find more informations at the following link

Integrations:

πŸ“¦ Installation

pip install scrapegraph-py

🎯 Core Features

  • πŸ€– AI-Powered Extraction & Search: Use natural language to extract data or search the web
  • πŸ“Š Structured Output: Get clean, structured data with optional schema validation
  • πŸ”„ Multiple Formats: Extract data as JSON, Markdown, or custom schemas
  • ⚑ High Performance: Concurrent processing and automatic retries
  • πŸ”’ Enterprise Ready: Production-grade security and rate limiting

πŸ› οΈ Available Endpoints

πŸ€– SmartScraper

Using AI to extract structured data from any webpage or HTML content with natural language prompts.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Extract data from a webpage
response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract the main heading, description, and summary of the webpage",
)

print(f"Request ID: {response['request_id']}")
print(f"Result: {response['result']}")

client.close()

πŸ” SearchScraper

Perform AI-powered web searches with structured results and reference URLs.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Perform AI-powered web search
response = client.searchscraper(
    user_prompt="What is the latest version of Python and what are its main features?",
    num_results=3,  # Number of websites to search (default: 3)
)

print(f"Result: {response['result']}")
print("\nReference URLs:")
for url in response["reference_urls"]:
    print(f"- {url}")

client.close()

πŸ“ Markdownify

Convert any webpage into clean, formatted markdown.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Convert webpage to markdown
response = client.markdownify(
    website_url="https://example.com",
)

print(f"Request ID: {response['request_id']}")
print(f"Markdown: {response['result']}")

client.close()

πŸ•·οΈ SmartCrawler

Intelligently crawl and extract data from multiple pages with configurable depth and batch processing.

Example Usage:

from scrapegraph_py import Client
import os
import time
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Start crawl job
crawl_response = client.crawl(
    url="https://example.com",
    prompt="Extract page titles and main headings",
    data_schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "headings": {"type": "array", "items": {"type": "string"}}
        }
    },
    depth=2,
    max_pages=5,
    same_domain_only=True,
)

crawl_id = crawl_response.get("id") or crawl_response.get("task_id")

# Poll for results
if crawl_id:
    for _ in range(10):
        time.sleep(5)
        result = client.get_crawl(crawl_id)
        if result.get("status") == "success":
            print("Crawl completed:", result["result"]["llm_result"])
            break

client.close()

πŸ€– AgenticScraper

Perform automated browser actions on webpages using AI-powered agentic scraping with session management.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Perform automated browser actions
response = client.agenticscraper(
    url="https://example.com",
    use_session=True,
    steps=[
        "Type email@gmail.com in email input box",
        "Type password123 in password inputbox",
        "click on login"
    ],
    ai_extraction=False  # Set to True for AI extraction
)

print(f"Request ID: {response['request_id']}")
print(f"Status: {response.get('status')}")

# Get results
result = client.get_agenticscraper(response['request_id'])
print(f"Result: {result.get('result')}")

client.close()

πŸ“„ Scrape

Convert webpages into HTML format with optional JavaScript rendering and custom headers.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Get HTML content from webpage
response = client.scrape(
    website_url="https://example.com",
    render_heavy_js=False,  # Set to True for JavaScript-heavy sites
)

print(f"Request ID: {response['request_id']}")
print(f"HTML length: {len(response.get('html', ''))} characters")

client.close()

⏰ Scheduled Jobs

Create, manage, and monitor scheduled scraping jobs with cron expressions and execution history.

πŸ’³ Credits

Check your API credit balance and usage.

πŸ’¬ Feedback

Send feedback and ratings for scraping requests to help improve the service.

🌟 Key Benefits

  • πŸ“ Natural Language Queries: No complex selectors or XPath needed
  • 🎯 Precise Extraction: AI understands context and structure
  • πŸ”„ Adaptive Processing: Works with both web content and direct HTML
  • πŸ“Š Schema Validation: Ensure data consistency with Pydantic
  • ⚑ Async Support: Handle multiple requests efficiently
  • πŸ” Source Attribution: Get reference URLs for search results

πŸ’‘ Use Cases

  • 🏒 Business Intelligence: Extract company information and contacts
  • πŸ“Š Market Research: Gather product data and pricing
  • πŸ“° Content Aggregation: Convert articles to structured formats
  • πŸ” Data Mining: Extract specific information from multiple sources
  • πŸ“± App Integration: Feed clean data into your applications
  • 🌐 Web Research: Perform AI-powered searches with structured results

πŸ“– Documentation

For detailed documentation and examples, visit:

πŸ’¬ Support & Feedback

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❀️ by ScrapeGraph AI

About

πŸ•·οΈ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. πŸ€– Hassle-free web scraping made simple.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published