Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Self-Host] call to playwright is failing #902

Open
rostwal95 opened this issue Nov 15, 2024 · 21 comments
Open

[Self-Host] call to playwright is failing #902

rostwal95 opened this issue Nov 15, 2024 · 21 comments
Assignees

Comments

@rostwal95
Copy link

Describe the Issue
Call to playwright fails when trying to scrape with playwright.

To Reproduce
Steps to reproduce the issue:

  1. Configure the environment or settings with '...'
  2. Run the command '...'
  3. Observe the error or unexpected output at '...'
  4. Log output/error message

Expected Behavior
The call to playwright should be successful and dynamic js should be rendered and cleaned up.

Screenshots
If applicable, add screenshots or copies of the command line output to help explain the self-hosting issue.

Environment (please complete the following information):

  • OS: [e.g. macOS, Linux, Windows]
  • Firecrawl Version: [e.g. 1.2.3]
  • Node.js Version: [e.g. 14.x]
  • Docker Version (if applicable): [e.g. 20.10.14]
  • Database Type and Version: [e.g. PostgreSQL 13.4]

Logs
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...
worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Request sent failure status
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via fetch...

here are the logs

Configuration
Provide relevant parts of your configuration files (with sensitive information redacted).

Additional Context
Add any other context about the self-hosting issue here, such as specific infrastructure details, network setup, or any modifications made to the original Firecrawl setup.

@mogery
Copy link
Member

mogery commented Nov 15, 2024

Can you share the logs of the playwright microservice as well?

@mogery mogery self-assigned this Nov 15, 2024
@mkaskov
Copy link

mkaskov commented Nov 15, 2024

the same problem.
with apps/playwright-service-ts

playwright-service-1 | SyntaxError: Unexpected token " in JSON at position 0
playwright-service-1 | at JSON.parse ()
playwright-service-1 | at createStrictSyntaxError (/usr/src/app/node_modules/body-parser/lib/types/json.js:169:10)
playwright-service-1 | at parse (/usr/src/app/node_modules/body-parser/lib/types/json.js:86:15)
playwright-service-1 | at /usr/src/app/node_modules/body-parser/lib/read.js:128:18
playwright-service-1 | at AsyncResource.runInAsyncScope (node:async_hooks:203:9)
playwright-service-1 | at invokeCallback (/usr/src/app/node_modules/raw-body/index.js:238:16)
playwright-service-1 | at done (/usr/src/app/node_modules/raw-body/index.js:227:7)
playwright-service-1 | at IncomingMessage.onEnd (/usr/src/app/node_modules/raw-body/index.js:287:7)
playwright-service-1 | at IncomingMessage.emit (node:events:517:28)
playwright-service-1 | at endReadableNT (node:internal/streams/readable:1400:12)

@mogery
Copy link
Member

mogery commented Nov 15, 2024

I just made a change, I think the way we sent the request to the microservice was wrong. Can you rebuild firecrawl (no need to rebuild playwright-service) and try again?

@rostwal95
Copy link
Author

I am getting errors while building the docker container as well -

=> ERROR [playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6 0.9s

[playwright-service 2/6] RUN apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6:
0.539 Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
0.645 Err:1 http://deb.debian.org/debian bookworm InRelease
0.645 At least one invalid signature was encountered.
0.648 Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
0.677 Err:2 http://deb.debian.org/debian bookworm-updates InRelease
0.677 At least one invalid signature was encountered.
0.693 Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
0.717 Err:3 http://deb.debian.org/debian-security bookworm-security InRelease
0.717 At least one invalid signature was encountered.
0.722 Reading package lists...
0.728 W: GPG error: http://deb.debian.org/debian bookworm InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian bookworm InRelease' is not signed.
0.728 W: GPG error: http://deb.debian.org/debian bookworm-updates InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian bookworm-updates InRelease' is not signed.
0.728 W: GPG error: http://deb.debian.org/debian-security bookworm-security InRelease: At least one invalid signature was encountered.
0.728 E: The repository 'http://deb.debian.org/debian-security bookworm-security InRelease' is not signed.


failed to solve: process "/bin/sh -c apt-get update && apt-get install -y --no-install-recommends gcc libstdc++6" did not complete successfully: exit code: 100

@rostwal95
Copy link
Author

rostwal95 commented Nov 15, 2024

I still see the issue, not sure why the logging level is not marked as error -

worker-1 | 2024-11-15 16:23:58 info [:]: 🐂 Worker taking job b2c3e207-55ca-4abb-8be1-57a0b1b88cd2
worker-1 | 2024-11-15 16:23:58 info [ScrapeURL:]: Scraping URL "https://www.britishairways.com/travel/home/public/en_us/"...
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine scrapingbee meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine scrapingbeeLoad meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine playwright meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine fetch meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine pdf meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-11-15 16:23:58 info [ScrapeURL:]: Scraping via scrapingbee...
worker-1 | 2024-11-15 16:23:59 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b2c3e207-55ca-4abb-8be1-57a0b1b88cd2","method":"","engine":"scrapingbee","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Engine scrapingbee could not scrape the page.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via scrapingbeeLoad...
worker-1 | 2024-11-15 16:23:59 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b2c3e207-55ca-4abb-8be1-57a0b1b88cd2","method":"","engine":"scrapingbeeLoad","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Engine scrapingbeeLoad could not scrape the page.
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-11-15 16:23:59 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...
worker-1 | 2024-11-15 16:23:59 debug [ScrapeURL:scrapeURLWithPlaywright]: Request failed
worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.

worker-1 | 2024-11-15 16:23:59 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-15 16:24:01 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveHTMLFromRawHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveHTMLFromRawHTML (7ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveMarkdownFromHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveMarkdownFromHTML (1ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveLinksFromHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveLinksFromHTML (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer deriveMetadataFromRawHTML...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer deriveMetadataFromRawHTML (4ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer uploadScreenshot...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer uploadScreenshot (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer performLLMExtract...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer performLLMExtract (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer coerceFieldsToFormats...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer coerceFieldsToFormats (0ms)
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Executing transformer removeBase64Images...
worker-1 | 2024-11-15 16:24:01 debug [ScrapeURL:]: Finished executing transformer removeBase64Images (0ms)
worker-1 | 2024-11-15 16:24:01 info [:]: 🐂 Job done b2c3e207-55ca-4abb-8be1-57a0b1b88cd2

response has empty markdown -

{
"success": true,
"data": {
"markdown": "",
"metadata": {
"title": "British Airways | Book Flights, Holidays, City Breaks & Check In Online",
"description": "Save on worldwide flights and holidays when you book directly with British Airways. Browse our guides, find great deals, manage your booking and check in online.",
"language": "en",
"robots": "all",
"ogLocaleAlternate": [],
"theme-color": "#ffffff",
"viewport": "width=device-width, initial-scale=1",
"sourceURL": "https://www.britishairways.com/travel/home/public/en_us/",
"url": "https://www.britishairways.com/travel/home/public/en_us/",
"statusCode": 200
}
}
}

@mkaskov
Copy link

mkaskov commented Nov 20, 2024

another error. after that happens firecrawl start working not correct

worker-1 | 2024-11-20 06:40:56 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:56 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:57 info [:]: 🐂 Job done 79431bc8-736d-4379-bf0d-ddae76e0dabe
api-1 | 2024-11-20 06:40:57 warn [:]: You're bypassing authentication {}
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scraping via fetch...
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scraping via fetch...
playwright-service-1 | ✅ Scrape successful!
worker-1 | 2024-11-20 06:40:57 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done 27e3c51f-1b4a-45a9-8b4f-abe61f67ac8a
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done e2ec67bb-740b-40fa-803d-4918ced6006c
worker-1 | 2024-11-20 06:40:58 info [:]: 🐂 Job done a252a1c1-aa4c-4f0c-960b-c379292cb997
worker-1 | 2024-11-20 06:40:59 info [:]: 🐂 Worker taking job 0ad857e1-70f6-4c2d-9255-2c890f207c5a
worker-1 | 2024-11-20 06:40:59 error [:]: 🐂 Job errored 0ad857e1-70f6-4c2d-9255-2c890f207c5a - TypeError: Cannot read properties of undefined (reading 'timeout') {}
worker-1 | 2024-11-20 06:40:59 error [:]: undefined {}
worker-1 | 2024-11-20 06:40:59 error [:]: TypeError: Cannot read properties of undefined (reading 'timeout')
worker-1 | at processJob (/app/dist/src/services/queue-worker.js:249:40)
worker-1 | at processJobInternal (/app/dist/src/services/queue-worker.js:65:30)
worker-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {}
worker-1 | /app/dist/src/main/runWebScraper.js:18
worker-1 | formats: job.data.scrapeOptions.formats.concat(["rawHtml"]),
worker-1 | ^
worker-1 |
worker-1 | TypeError: Cannot read properties of undefined (reading 'formats')
worker-1 | at startWebScraperPipeline (/app/dist/src/main/runWebScraper.js:18:49)
worker-1 | at processJob (/app/dist/src/services/queue-worker.js:245:57)
worker-1 | at processJobInternal (/app/dist/src/services/queue-worker.js:65:30)
worker-1 | at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
worker-1 |
worker-1 | Node.js v20.18.0
worker-1 exited with code 1

@lauridskern
Copy link

same issue for me

@fatwang2
Copy link

same issue

@shingoxray
Copy link

same issue +1

1 similar comment
@zhucan
Copy link

zhucan commented Dec 3, 2024

same issue +1

@Hanxiao-Adam-Qi
Copy link

Same issue "info [ScrapeURL:]: An unexpected error happened while scraping with playwright. ", both in original playwright and playwright-ts

@riddlegit
Copy link

riddlegit commented Dec 9, 2024

Cannot read properties of undefined (reading 'timeout')

I guess this kind of error should be caused by some job config properties missing, maybe try to add a "timeout" property in json job data or scrapOptions.
https://docs.firecrawl.dev/v1-welcome

@lune-sta
Copy link

same issue

@mogery
Copy link
Member

mogery commented Dec 15, 2024

Hey y'all! This should be fixed by #977 which we just merged. Can you re-test?

@rostwal95
Copy link
Author

playwright-service-1 | [2024-12-15 15:17:25 +0000] [10] [INFO] Running on http://[::]:3000 (CTRL + C to quit)
api-1 | 2024-12-15 15:18:09 warn [:]: You're bypassing authentication {}
api-1 | 2024-12-15 15:18:09 warn [:]: You're bypassing authentication {}
worker-1 | 2024-12-15 15:18:09 info [queue-worker:processJob]: 🐂 Worker taking job b4219ad0-71ee-4551-9a4a-923eaa71d301
worker-1 | 2024-12-15 15:18:09 info [ScrapeURL:]: Scraping URL "https://www.britishairways.com/travel/home/public/en_us/"...
worker-1 | 2024-12-15 15:18:09 debug [ScrapeURL:]: Engine scrapingbee meets feature priority threshold
worker-1 | 2024-12-15 15:18:09 debug [ScrapeURL:]: Engine scrapingbeeLoad meets feature priority threshold
worker-1 | 2024-12-15 15:18:09 debug [ScrapeURL:]: Engine playwright meets feature priority threshold
worker-1 | 2024-12-15 15:18:09 debug [ScrapeURL:]: Engine fetch meets feature priority threshold
worker-1 | 2024-12-15 15:18:09 debug [ScrapeURL:]: Engine pdf meets feature priority threshold
worker-1 | 2024-12-15 15:18:09 debug [ScrapeURL:]: Engine docx meets feature priority threshold
worker-1 | 2024-12-15 15:18:09 info [ScrapeURL:]: Scraping via scrapingbee...
worker-1 | 2024-12-15 15:18:10 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b4219ad0-71ee-4551-9a4a-923eaa71d301","scrapeURL":"https://www.britishairways.com/travel/home/public/en_us/","method":"","engine":"scrapingbee","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-12-15 15:18:10 info [ScrapeURL:]: Engine scrapingbee could not scrape the page.
worker-1 | 2024-12-15 15:18:10 info [ScrapeURL:]: Scraping via scrapingbeeLoad...
worker-1 | 2024-12-15 15:18:10 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"b4219ad0-71ee-4551-9a4a-923eaa71d301","scrapeURL":"https://www.britishairways.com/travel/home/public/en_us/","method":"","engine":"scrapingbeeLoad","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2024-12-15 15:18:10 info [ScrapeURL:]: Engine scrapingbeeLoad could not scrape the page.
worker-1 | 2024-12-15 15:18:10 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2024-12-15 15:18:10 debug [ScrapeURL:scrapeURLWithPlaywright]: Request failed
worker-1 | 2024-12-15 15:18:10 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2024-12-15 15:18:10 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2024-12-15 15:18:11 info [ScrapeURL:]: Scrape via fetch deemed successful.
worker-1 | 2024-12-15 15:18:11 debug [ScrapeURL:]: Executed transformers.
worker-1 | 2024-12-15 15:18:11 info [queue-worker:processJob]: 🐂 Job done b4219ad0-71ee-4551-9a4a-923eaa71d301
worker-1 | 2024-12-15 15:18:11 debug [queue-worker:processJobInternal]: Job succeeded -- putting result in Redis
api-1 | 2024-12-15 15:18:11 warn [:]: You're bypassing authentication {}

pulled the latest code .. still the same

@rostwal95
Copy link
Author

api-1 | 2024-12-15 18:10:43 warn [:]: You're bypassing authentication {}
api-1 | 2024-12-15 18:10:43 warn [:]: You're bypassing authentication {}
api-1 | 2024-12-15 18:10:43 debug [api/v1:crawlController]: Crawl 130c2cee-6bf8-417b-a25f-7cfcb7152680 starting
api-1 | 2024-12-15 18:10:43 debug [api/v1:crawlController]: Determined limit: 10000
api-1 | 2024-12-15 18:10:48 debug [api/v1:crawlController]: Failed to get robots.txt (this is probably fine!)
api-1 | 2024-12-15 18:10:48 debug [crawl-redis:saveCrawl]: Saving crawl 130c2cee-6bf8-417b-a25f-7cfcb7152680 to Redis...
api-1 | 2024-12-15 18:10:48 debug [WebCrawler:tryGetSitemap]: Fetching sitemap links from https://www.britishairways.com/travel/home/public/en_us/
api-1 | 2024-12-15 18:10:53 debug [WebCrawler:tryFetchSitemapLinks]: Failed to fetch sitemap with axios from https://www.britishairways.com/travel/home/public/en_us//sitemap.xml
api-1 | 2024-12-15 18:10:53 info [ScrapeURL:]: Scraping URL "https://www.britishairways.com/travel/home/public/en_us//sitemap.xml"...
api-1 | 2024-12-15 18:10:53 debug [ScrapeURL:]: Engine fire-engine;tlsclient meets feature priority threshold
api-1 | 2024-12-15 18:10:53 info [ScrapeURL:]: Scraping via fire-engine;tlsclient...
api-1 | 2024-12-15 18:10:53 debug [ScrapeURL:fireEngineScrape/robustFetch]: Request failed, trying 2 more times
api-1 | 2024-12-15 18:10:53 debug [ScrapeURL:fireEngineScrape/robustFetch]: Request failed, trying 1 more times
api-1 | 2024-12-15 18:10:53 debug [ScrapeURL:fireEngineScrape/robustFetch]: Request failed
api-1 | 2024-12-15 18:10:53 info [ScrapeURL:]: An unexpected error happened while scraping with fire-engine;tlsclient.
api-1 | 2024-12-15 18:10:53 warn [ScrapeURL:]: scrapeURL: All scraping engines failed! {"module":"ScrapeURL","scrapeId":"sitemap","scrapeURL":"https://www.britishairways.com/travel/home/public/en_us//sitemap.xml","error":{"fallbackList":["fire-engine;tlsclient"],"results":{"fire-engine;tlsclient":{"state":"error","error":{"name":"Error","message":"Request failed","stack":"Error: Request failed\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:81:23)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:390:34)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"params":{"url":"undefined/scrape","logger":{},"method":"POST","body":{"url":"https://www.britishairways.com/travel/home/public/en_us//sitemap.xml","engine":"tlsclient","instantReturn":true,"disableJsDom":true,"timeout":30000},"headers":{},"schema":{"_def":{"unknownKeys":"strip","catchall":{"_def":{"typeName":"ZodNever"}},"typeName":"ZodObject"},"_cached":null},"ignoreResponse":false,"ignoreFailure":false,"tryCount":1},"requestId":"43b53df2-a334-4c6e-8f7a-fcf31d8af173","error":{"name":"TypeError","message":"Failed to parse URL from undefined/scrape","stack":"TypeError: Failed to parse URL from undefined/scrape\n at node:internal/deps/undici/undici:13392:13\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:19)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:390:34)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"code":"ERR_INVALID_URL","input":"undefined/scrape","name":"TypeError","message":"Invalid URL","stack":"TypeError: Invalid URL\n at new URL (node:internal/url:806:29)\n at new Request (node:internal/deps/undici/undici:9474:25)\n at fetch (node:internal/deps/undici/undici:10203:25)\n at fetch (node:internal/deps/undici/undici:13390:10)\n at fetch (node:internal/bootstrap/web/exposed-window-or-worker:72:12)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:25)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:30)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)"}}}},"unexpected":true,"startedAt":1734286253679,"finishedAt":1734286253690}},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:211:15)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:390:34)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)"}}
api-1 | 2024-12-15 18:10:53 error [WebCrawler:getLinksFromSitemap]: Request failed for https://www.britishairways.com/travel/home/public/en_us//sitemap.xml {"crawlId":"130c2cee-6bf8-417b-a25f-7cfcb7152680","module":"WebCrawler","method":"getLinksFromSitemap","mode":"fire-engine","sitemapUrl":"https://www.britishairways.com/travel/home/public/en_us//sitemap.xml","error":{"fallbackList":["fire-engine;tlsclient"],"results":{"fire-engine;tlsclient":{"state":"error","error":{"name":"Error","message":"Request failed","stack":"Error: Request failed\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:81:23)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:390:34)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"params":{"url":"undefined/scrape","logger":{},"method":"POST","body":{"url":"https://www.britishairways.com/travel/home/public/en_us//sitemap.xml","engine":"tlsclient","instantReturn":true,"disableJsDom":true,"timeout":30000},"headers":{},"schema":{"_def":{"unknownKeys":"strip","catchall":{"_def":{"typeName":"ZodNever"}},"typeName":"ZodObject"},"_cached":null},"ignoreResponse":false,"ignoreFailure":false,"tryCount":1},"requestId":"43b53df2-a334-4c6e-8f7a-fcf31d8af173","error":{"name":"TypeError","message":"Failed to parse URL from undefined/scrape","stack":"TypeError: Failed to parse URL from undefined/scrape\n at node:internal/deps/undici/undici:13392:13\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:19)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:390:34)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"code":"ERR_INVALID_URL","input":"undefined/scrape","name":"TypeError","message":"Invalid URL","stack":"TypeError: Invalid URL\n at new URL (node:internal/url:806:29)\n at new Request (node:internal/deps/undici/undici:9474:25)\n at fetch (node:internal/deps/undici/undici:10203:25)\n at fetch (node:internal/deps/undici/undici:13390:10)\n at fetch (node:internal/bootstrap/web/exposed-window-or-worker:72:12)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:25)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:30)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)"}}}},"unexpected":true,"startedAt":1734286253679,"finishedAt":1734286253690}},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:211:15)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:390:34)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)"}}
api-1 | 2024-12-15 18:10:58 debug [WebCrawler:tryFetchSitemapLinks]: Failed to fetch sitemap from https://www.britishairways.com/sitemap.xml
api-1 | 2024-12-15 18:10:58 info [ScrapeURL:]: Scraping URL "https://www.britishairways.com/sitemap.xml"...
api-1 | 2024-12-15 18:10:58 debug [ScrapeURL:]: Engine fire-engine;tlsclient meets feature priority threshold
api-1 | 2024-12-15 18:10:58 info [ScrapeURL:]: Scraping via fire-engine;tlsclient...
api-1 | 2024-12-15 18:10:58 debug [ScrapeURL:fireEngineScrape/robustFetch]: Request failed, trying 2 more times
api-1 | 2024-12-15 18:10:58 debug [ScrapeURL:fireEngineScrape/robustFetch]: Request failed, trying 1 more times
api-1 | 2024-12-15 18:10:58 debug [ScrapeURL:fireEngineScrape/robustFetch]: Request failed
api-1 | 2024-12-15 18:10:58 info [ScrapeURL:]: An unexpected error happened while scraping with fire-engine;tlsclient.
api-1 | 2024-12-15 18:10:58 warn [ScrapeURL:]: scrapeURL: All scraping engines failed! {"module":"ScrapeURL","scrapeId":"sitemap","scrapeURL":"https://www.britishairways.com/sitemap.xml","error":{"fallbackList":["fire-engine;tlsclient"],"results":{"fire-engine;tlsclient":{"state":"error","error":{"name":"Error","message":"Request failed","stack":"Error: Request failed\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:81:23)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:416:36)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"params":{"url":"undefined/scrape","logger":{},"method":"POST","body":{"url":"https://www.britishairways.com/sitemap.xml","engine":"tlsclient","instantReturn":true,"disableJsDom":true,"timeout":30000},"headers":{},"schema":{"_def":{"unknownKeys":"strip","catchall":{"_def":{"typeName":"ZodNever"}},"typeName":"ZodObject"},"_cached":null},"ignoreResponse":false,"ignoreFailure":false,"tryCount":1},"requestId":"d3539c0b-ce44-4843-b532-5894a8d9ffb1","error":{"name":"TypeError","message":"Failed to parse URL from undefined/scrape","stack":"TypeError: Failed to parse URL from undefined/scrape\n at node:internal/deps/undici/undici:13392:13\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:19)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:416:36)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"code":"ERR_INVALID_URL","input":"undefined/scrape","name":"TypeError","message":"Invalid URL","stack":"TypeError: Invalid URL\n at new URL (node:internal/url:806:29)\n at new Request (node:internal/deps/undici/undici:9474:25)\n at fetch (node:internal/deps/undici/undici:10203:25)\n at fetch (node:internal/deps/undici/undici:13390:10)\n at fetch (node:internal/bootstrap/web/exposed-window-or-worker:72:12)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:25)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:30)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)"}}}},"unexpected":true,"startedAt":1734286258707,"finishedAt":1734286258717}},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:211:15)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:416:36)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)"}}
api-1 | 2024-12-15 18:10:58 error [WebCrawler:getLinksFromSitemap]: Request failed for https://www.britishairways.com/sitemap.xml {"crawlId":"130c2cee-6bf8-417b-a25f-7cfcb7152680","module":"WebCrawler","method":"getLinksFromSitemap","mode":"fire-engine","sitemapUrl":"https://www.britishairways.com/sitemap.xml","error":{"fallbackList":["fire-engine;tlsclient"],"results":{"fire-engine;tlsclient":{"state":"error","error":{"name":"Error","message":"Request failed","stack":"Error: Request failed\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:81:23)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:416:36)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"params":{"url":"undefined/scrape","logger":{},"method":"POST","body":{"url":"https://www.britishairways.com/sitemap.xml","engine":"tlsclient","instantReturn":true,"disableJsDom":true,"timeout":30000},"headers":{},"schema":{"_def":{"unknownKeys":"strip","catchall":{"_def":{"typeName":"ZodNever"}},"typeName":"ZodObject"},"_cached":null},"ignoreResponse":false,"ignoreFailure":false,"tryCount":1},"requestId":"d3539c0b-ce44-4843-b532-5894a8d9ffb1","error":{"name":"TypeError","message":"Failed to parse URL from undefined/scrape","stack":"TypeError: Failed to parse URL from undefined/scrape\n at node:internal/deps/undici/undici:13392:13\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:19)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:416:36)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)","cause":{"code":"ERR_INVALID_URL","input":"undefined/scrape","name":"TypeError","message":"Invalid URL","stack":"TypeError: Invalid URL\n at new URL (node:internal/url:806:29)\n at new Request (node:internal/deps/undici/undici:9474:25)\n at fetch (node:internal/deps/undici/undici:10203:25)\n at fetch (node:internal/deps/undici/undici:13390:10)\n at fetch (node:internal/bootstrap/web/exposed-window-or-worker:72:12)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:45:25)\n at robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:30)\n at async robustFetch (/app/dist/src/scraper/scrapeURL/lib/fetch.js:73:24)\n at async /app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:43:16\n at async fireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/scrape.js:37:27)\n at async performFireEngineScrape (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:37:20)\n at async scrapeURLWithFireEngineTLSClient (/app/dist/src/scraper/scrapeURL/engines/fire-engine/index.js:214:20)\n at async scrapeURLWithEngine (/app/dist/src/scraper/scrapeURL/engines/index.js:294:12)\n at async scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:121:35)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)"}}}},"unexpected":true,"startedAt":1734286258707,"finishedAt":1734286258717}},"name":"Error","message":"All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].","stack":"Error: All scraping engines failed! -- Double check the URL to make sure it's not broken. If the issue persists, contact us at [email protected].\n at scrapeURLLoop (/app/dist/src/scraper/scrapeURL/index.js:211:15)\n at async scrapeURL (/app/dist/src/scraper/scrapeURL/index.js:249:24)\n at async getLinksFromSitemap (/app/dist/src/scraper/WebScraper/sitemap.js:22:34)\n at async WebCrawler.tryFetchSitemapLinks (/app/dist/src/scraper/WebScraper/crawler.js:416:36)\n at async WebCrawler.tryGetSitemap (/app/dist/src/scraper/WebScraper/crawler.js:173:30)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:91:11)"}}

tried crawl on the same url.

@rostwal95
Copy link
Author

@mogery is there any update on this one ?

worker-1 | 2025-01-02 11:26:56 info [ScrapeURL:]: Scraping via scrapingbee...
worker-1 | 2025-01-02 11:26:57 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"54e64e5e-d859-4b06-890f-2fac04054b1e","scrapeURL":"https://www.britishairways.com/travel/home/public/en_us/","method":"","engine":"scrapingbee","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2025-01-02 11:26:57 info [ScrapeURL:]: Engine scrapingbee could not scrape the page.
worker-1 | 2025-01-02 11:26:57 info [ScrapeURL:]: Scraping via scrapingbeeLoad...
worker-1 | 2025-01-02 11:26:57 error [ScrapeURL:]: ScrapingBee threw an error {"module":"ScrapeURL","scrapeId":"54e64e5e-d859-4b06-890f-2fac04054b1e","scrapeURL":"https://www.britishairways.com/travel/home/public/en_us/","method":"","engine":"scrapingbeeLoad","body":{"message":"Invalid api key: # use if you'd like to use as a fallback scraper"}}
worker-1 | 2025-01-02 11:26:57 info [ScrapeURL:]: Engine scrapingbeeLoad could not scrape the page.
worker-1 | 2025-01-02 11:26:57 info [ScrapeURL:]: Scraping via playwright...
worker-1 | 2025-01-02 11:26:57 debug [ScrapeURL:scrapeURLWithPlaywright]: Request failed
worker-1 | 2025-01-02 11:26:57 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.
worker-1 | 2025-01-02 11:26:57 info [ScrapeURL:]: Scraping via fetch...
worker-1 | 2025-01-02 11:26:58 info [ScrapeURL:]: Scrape via fetch deemed successful.

@Wanli063
Copy link

Wanli063 commented Jan 7, 2025

Describe the Issue  描述问题 Call to playwright fails when trying to scrape with playwright.调用 playwright 时尝试使用 playwright 抓取失败。

To Reproduce  无法复现 Steps to reproduce the issue:复现问题的步骤:

  1. Configure the environment or settings with '...'配置环境或设置使用 '...'
  2. Run the command '...'运行命令 '...'
  3. Observe the error or unexpected output at '...'观察'...'处的错误或意外输出
  4. Log output/error message日志输出/错误信息

Expected Behavior  预期行为 The call to playwright should be successful and dynamic js should be rendered and cleaned up.调用 playwright 应该成功,并且应该渲染和清理动态 js。

Screenshots  屏幕截图 If applicable, add screenshots or copies of the command line output to help explain the self-hosting issue.如果适用,添加屏幕截图或命令行输出的副本以帮助解释自托管问题。

Environment (please complete the following information):环境(请填写以下信息):

  • OS: [e.g. macOS, Linux, Windows]操作系统:[例如 macOS、Linux、Windows]
  • Firecrawl Version: [e.g. 1.2.3]Firecrawl 版本:[例如 1.2.3]
  • Node.js Version: [e.g. 14.x]Node.js 版本:[例如 14.x]
  • Docker Version (if applicable): [e.g. 20.10.14]Docker 版本(如适用): [例如 20.10.14]
  • Database Type and Version: [e.g. PostgreSQL 13.4]数据库类型和版本:[例如 PostgreSQL 13.4]

Logs  日志 worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:]: Engine docx meets feature priority thresholdworker-1 | 2024-11-15 05:13:48 调试 [抓取 URL:]:引擎 docx 符合功能优先级阈值 worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via playwright...worker-1 | 2024-11-15 05:13:48 信息 [抓取 URL:]: 通过 playwright 进行抓取... worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Sending request...worker-1 | 2024-11-15 05:13:48 调试 [抓取 URL:使用 Playwright 抓取 URL]: 发送请求... worker-1 | 2024-11-15 05:13:48 debug [ScrapeURL:scrapeURLWithPlaywright]: Request sent failure statusworker-1 | 2024-11-15 05:13:48 调试 [抓取 URL:使用 Playwright 抓取 URL]: 请求发送失败状态 worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: An unexpected error happened while scraping with playwright.worker-1 | 2024-11-15 05:13:48 信息 [抓取 URL:] 在使用 playwright 抓取时发生了一个意外错误。 worker-1 | 2024-11-15 05:13:48 info [ScrapeURL:]: Scraping via fetch...worker-1 | 2024-11-15 05:13:48 信息 [抓取 URL:]: 通过 fetch 抓取...

here are the logs这里是有日志

Configuration  配置 Provide relevant parts of your configuration files (with sensitive information redacted).提供您配置文件的相关部分(敏感信息已删除)。

Additional Context  附加背景信息 Add any other context about the self-hosting issue here, such as specific infrastructure details, network setup, or any modifications made to the original Firecrawl setup.在此处添加有关自托管问题的任何其他背景信息,例如特定基础设施细节、网络设置或对原始 Firecrawl 设置的任何修改。

@Wanli063
Copy link

Wanli063 commented Jan 7, 2025

same issue +1,Is there any solution?

@wesselhuising
Copy link

I think there is a problem with the Zod validation of the response. When setting the log level to DEBUG, the following error pops up.

worker-1 | 2025-01-11 11:13:21 debug [ScrapeURL:scrapeURLWithPlaywright]: Response does not match provided schema

The /html POST endpoint is working and the response does look "good" to me, it matches the response schema of Zod. I might be missing something here as I am new to this project, but my feeling tells me it has to do with the Zod schema validation of the response from the POST call to the /html route.

@wesselhuising
Copy link

Update; removing the response validation using Zod fixes the situation in the engine file (scraper/scrapeURL/engines/playwright/index.ts).

This is the failing code:

      schema: z.object({
        content: z.string(),
        pageStatusCode: z.number(),
        pageError: z.string().optional()
      }),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests