Skip to content

Does it support pdf and how do we crawl a website ?? #10

@rostwal95

Description

@rostwal95

I have tried below curl -

curl -H "X-Respond-With: markdown" 'http://127.0.0.1:3000/https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'

below are the logs from docker container -

2024-10-13 18:08:01 [Crawler] INFO: Crawl request received for URL: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl method called with request: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 req.headers: {"host":"127.0.0.1:3000","user-agent":"curl/8.7.1","accept":"/","x-respond-with":"markdown"}
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Added to circuit breaker hosts: 127.0.0.1
2024-10-13 18:08:01 Cookies: []
2024-10-13 18:08:01 Configured crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [Crawler] INFO: Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Using default scraping method
2024-10-13 18:08:01 Scraping options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Scraping https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf {
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Attempting to set cookies: []
2024-10-13 18:08:01 Formatting snapshot {
2024-10-13 18:08:01 mode: 'markdown',
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Processing HTML content
2024-10-13 18:08:01 Getting Turndown service {
2024-10-13 18:08:01 url: URL {
2024-10-13 18:08:01 href: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 origin: 'https://assets.airtel.in',
2024-10-13 18:08:01 protocol: 'https:',
2024-10-13 18:08:01 username: '',
2024-10-13 18:08:01 password: '',
2024-10-13 18:08:01 host: 'assets.airtel.in',
2024-10-13 18:08:01 hostname: 'assets.airtel.in',
2024-10-13 18:08:01 port: '',
2024-10-13 18:08:01 pathname: '/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 search: '',
2024-10-13 18:08:01 searchParams: URLSearchParams {},
2024-10-13 18:08:01 hash: ''
2024-10-13 18:08:01 },
2024-10-13 18:08:01 imgDataUrlToObjectUrl: true
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Adding Turndown rules
2024-10-13 18:08:01 Adding data-url-to-pseudo-object-url rule
2024-10-13 18:08:01 Turndown service configured
2024-10-13 18:08:01 Skipping parsed content processing

did not get any response. am I missing something ??

also if possible please add helm charts on how to deploy it on a kubernetes cluster, it will be really helpful.

what should be the crawl request, if you can add example in the readme.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions