-
Notifications
You must be signed in to change notification settings - Fork 50
Description
I have tried below curl -
curl -H "X-Respond-With: markdown" 'http://127.0.0.1:3000/https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
below are the logs from docker container -
2024-10-13 18:08:01 [Crawler] INFO: Crawl request received for URL: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl method called with request: /https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 req.headers: {"host":"127.0.0.1:3000","user-agent":"curl/8.7.1","accept":"/","x-respond-with":"markdown"}
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Request headers: {
2024-10-13 18:08:01 host: '127.0.0.1:3000',
2024-10-13 18:08:01 'user-agent': 'curl/8.7.1',
2024-10-13 18:08:01 accept: '/',
2024-10-13 18:08:01 'x-respond-with': 'markdown'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Added to circuit breaker hosts: 127.0.0.1
2024-10-13 18:08:01 Cookies: []
2024-10-13 18:08:01 Configured crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [Crawler] INFO: Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Starting scrap for URL: https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf
2024-10-13 18:08:01 Crawl options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Crawler options: CrawlerOptionsHeaderOnly {
2024-10-13 18:08:01 respondWith: 'markdown',
2024-10-13 18:08:01 withGeneratedAlt: false,
2024-10-13 18:08:01 withLinksSummary: false,
2024-10-13 18:08:01 withImagesSummary: false,
2024-10-13 18:08:01 noCache: false,
2024-10-13 18:08:01 keepImgDataUrl: false,
2024-10-13 18:08:01 withIframe: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 userAgent: undefined,
2024-10-13 18:08:01 proxyUrl: undefined
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Using default scraping method
2024-10-13 18:08:01 Scraping options: {
2024-10-13 18:08:01 proxyUrl: undefined,
2024-10-13 18:08:01 cookies: [],
2024-10-13 18:08:01 favorScreenshot: false,
2024-10-13 18:08:01 removeSelector: undefined,
2024-10-13 18:08:01 targetSelector: undefined,
2024-10-13 18:08:01 waitForSelector: undefined,
2024-10-13 18:08:01 overrideUserAgent: undefined,
2024-10-13 18:08:01 timeoutMs: undefined,
2024-10-13 18:08:01 withIframe: false
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Scraping https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf {
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 [CHANGE_LOGGER_NAME] INFO: Page 40: Attempting to set cookies: []
2024-10-13 18:08:01 Formatting snapshot {
2024-10-13 18:08:01 mode: 'markdown',
2024-10-13 18:08:01 url: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf'
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Processing HTML content
2024-10-13 18:08:01 Getting Turndown service {
2024-10-13 18:08:01 url: URL {
2024-10-13 18:08:01 href: 'https://assets.airtel.in/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 origin: 'https://assets.airtel.in',
2024-10-13 18:08:01 protocol: 'https:',
2024-10-13 18:08:01 username: '',
2024-10-13 18:08:01 password: '',
2024-10-13 18:08:01 host: 'assets.airtel.in',
2024-10-13 18:08:01 hostname: 'assets.airtel.in',
2024-10-13 18:08:01 port: '',
2024-10-13 18:08:01 pathname: '/teams/simplycms/ADTECH/docs/Integrated_Report_and_Annual_Financial_Statements.pdf',
2024-10-13 18:08:01 search: '',
2024-10-13 18:08:01 searchParams: URLSearchParams {},
2024-10-13 18:08:01 hash: ''
2024-10-13 18:08:01 },
2024-10-13 18:08:01 imgDataUrlToObjectUrl: true
2024-10-13 18:08:01 }
2024-10-13 18:08:01 Adding Turndown rules
2024-10-13 18:08:01 Adding data-url-to-pseudo-object-url rule
2024-10-13 18:08:01 Turndown service configured
2024-10-13 18:08:01 Skipping parsed content processing
did not get any response. am I missing something ??
also if possible please add helm charts on how to deploy it on a kubernetes cluster, it will be really helpful.
what should be the crawl request, if you can add example in the readme.