24 Jan 22:50

nickscamara

fa5544a

Extract Improvements - v1.4.1 Latest

Latest

We've significantly enhanced our data extraction capabilities with several key updates:

Extract now returns a lot more data due to a new re-ranker system
Improved infrastructure reliability
Migrated from Cheerio to a high-performance Rust-based parser for faster and more memory-efficient parsing
Enhanced crawl cancellation functionality for better control over running jobs

What's Changed

Added "today" to extract prompts by @rafaelsideguide in #1084
docs: update cancel crawl response by @ftonato in #1087
port most of cheerio stuff to rust by @mogery in #1089
Re-ranker changes by @nickscamara in #1090
Rerank with lower threshold + back to map if length = 0 by @rafaelsideguide in #1086

Full Changelog: v1.4.0...1.4.1

Contributors

ftonato, nickscamara, and 2 other contributors

Assets 2

20 Jan 14:17

nickscamara

v1.4.0

2d4f4de

Introducing /extract - v.1.4.0

Get structured web data with /extract

We’re excited to announce the release of /extract - get data from any website with just a prompt. With /extract, you can retrieve any information from anywhere on a website without being limited by scraping roadblocks or the typical context constraints of LLMs.

No more manual copy-pasting, broken scraping scripts, or debugging LLM calls. - it’s never been easier to enrich your data, create datasets, or power AI applications with clean, structured data from any website.

Companies are already using extract to:

Enrich CRM data
Streamline KYB processes
Monitor competitors
Supercharge onboarding experiences
Build targeted prospecting lists

Instead of spending hours manually researching, fixing broken scrapers, or piecing together data from multiple sources, simply specify what information you need and the target website, and let the Firecrawl handle the entire retrieval process.

Specifically, you can:

Extract structured data from entire websites using URL wildcards (https://example.com/*)
Define custom schemas to capture exactly what you need—from simple product details to complex organizational structures
Guide the extraction with custom prompts to ensure the LLM focuses on your target information
Deploy anywhere with comprehensive support for Python, Node, cURL, and other popular tools. For no-code workflows, just connect via Zapier or use our API to set up integrations with other tools.

This versatility translates into a wide range of real-world applications—enabling you to enrich web data for just about any use case.

Limitations - (and the road ahead)

Let's be honest - while /extract is pretty awesome at grabbing web data, it's not perfect yet. Here's what we're still working on:
Big sites are tricky - It can't (yet!) grab every single product on Amazon in one go
Complex searches need work - Things like "find all posts from 2025" aren't quite there
Sometimes, it's a bit quirky - Results can vary between runs, though it usually gets what you need
But here's the exciting part: we're seeing the future of web scraping take shape

Try it out

Curious to try /extract out for yourself?
Visit our playground to try out /extract - you get 500,000 tokens for free
Dive into our Extract Beta documentation for detailed technical guidance and API reference
Want a no-code solution? Connect /extract to thousands of applications through our enhanced Zapier integration

That's all for now! Happy Extracting from the whole Firecrawl team 🔥

Full Changelog: v.1.3.0...v1.4.0

Assets 2

0 Join discussion

14 Jan 22:40

nickscamara

v.1.3.0

957eea4

v1.3 - /extract improvements

What's Changed

feat: new snips test framework (FIR-414) by @mogery in #1033
(feat/extract) New re-ranker + multi entity extraction by @nickscamara in #1061
__experimental_streamSteps by @nickscamara in #1063

Full Changelog: v1.2.1...v.1.3.0

Contributors

nickscamara and mogery

Assets 2

10 Jan 17:54

nickscamara

v1.2.1

d1f3b96

v1.2.1 - /extract Beta Improvements

What's Changed

Indexes, Caching for /extract, Improvements by @nickscamara in #1037
[SDK] fixed none and undefined on response by @rafaelsideguide in #1034
feat: use new random user agent instead of the old one by @1101-1 in #1038
(feat/extract) Move extract to a queue system by @nickscamara in #1044

/extract (beta) changes

We have updated the /extract endpoint to now be asynchronous. When you make a request to /extract, it will return an ID that you can use to check the status of your extract job. If you are using our SDKs, there are no changes required to your code, but please make sure to update the SDKs to the latest versions as soon as possible.
For those using the API directly, we have made it backwards compatible. However, you have 10 days to update your implementation to the new asynchronous model.
For more details about the parameters, refer to the docs sent to you.

New Contributors

@1101-1 made their first contribution in #1038

Full Changelog: v1.2.0...v1.2.1

Changelog: https://www.firecrawl.dev/changelog#/extract-changes

Contributors

nickscamara, 1101-1, and rafaelsideguide

Assets 2

02 Jan 23:24

nickscamara

v1.2.0

a4b6dfe

v1.2.0 - v1/search is now available!

/v1/search

The search endpoint combines web search with Firecrawl’s scraping capabilities to return full page content for any query.

Include scrapeOptions with formats: ["markdown"] to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).

More info here /v1/search docs

What's Changed

/extract URL trace by @nickscamara in #1014
(feat/v1) Search by @nickscamara in #1032

Full Changelog: v1.1.1...v1.2.0

Contributors

nickscamara

Assets 2

30 Dec 15:30

nickscamara

v1.1.1

71a8f74

v1.1.1

What's Changed

feat(python-sdk): Make API key optional for self-hosted instances by @RutamBhagat in #990
Sitemap fixes by @mogery in #1010
fixed optional+default bug on llm schema by @rafaelsideguide in #955
[FIR-37] feat: extract and return favicon URL during scraping by @ftonato in #1018
fix: merge mock success data by @yujunhui in #1013
feat(rust-sdk): Make API key optional for self-hosted instances by @RutamBhagat in #991
feat(scrapeURL/pdf): switch to MU (FIR-356) by @mogery in #1016

New Contributors

@ftonato made their first contribution in #1018
@yujunhui made their first contribution in #1013

Full Changelog: v1.1.0...v1.1.1

Contributors

ftonato, yujunhui, and 3 other contributors

Assets 2

27 Dec 18:34

nickscamara

v1.1.0

c5b6495

v1.1.0

Starting today we are going to be posting weekly releases here and on firecrawl.dev/changelog. This release is just a summary of all the improvements and fixes we pushed since v1 release here. Thank you all for the contributions!

v1.1.0

Changelog Highlights

Feature Enhancements

New Features:
- Geolocation, mobile scraping, 4x faster parsing, better webhooks,
- Credit packs, auto-recharges and batch scraping support.
- Iframe support and query parameter differentiation for URLs.
- Similar URL deduplication.
- Enhanced map ranking and sitemap fetching.

Performance Improvements

Faster crawl status filtering and improved map ranking algorithm.
Optimized Kubernetes setup and simplified build processes.
Sitemap discoverability and performance improved

Bug Fixes

Resolved issues:
- Badly formatted JSON, scrolling actions, and encoding errors.
- Crawl limits, relative URLs, and missing error handlers.
Fixed self-hosted crawling inconsistencies and schema errors.

SDK Updates

Added dynamic WebSocket imports with fallback support.
Optional API keys for self-hosted instances.
Improved error handling across SDKs.

Documentation Updates

Improved API docs and examples.
Updated self-hosting URLs and added Kubernetes optimizations.
Added articles: mastering /scrape and /crawl.

Miscellaneous

Added new Firecrawl examples
Enhanced metadata handling for webhooks and improved sitemap fetching.
Updated blocklist and streamlined error messages.

What's Changed

Add docs to api spec example by @ericciarla in #637
[Docs] upgraded the path of the self-hosted documentation URL to /v1. by @shige in #635
Removal of generic classnames/ids from onlyMainContent cleaning by @nickscamara in #638
Improved team credits check and billing notifications by @nickscamara in #640
Fixed 500 errors when JSON is badly formatted by @nickscamara in #648
Better engine for wait + other params by @nickscamara in #649
fix(py-sdk): removed asyncio package by @rafaelsideguide in #654
perf(js-sdk): move dotenv and uuid to devDependencies, fix zod import by @MonsterDeveloper in #614
build(js-sdk): simplify build process by @MonsterDeveloper in #611
fix(v0/crawl-status): don't crash on big crawls when requesting jobs from supa by @mogery in #653
Manual Rate Limiter for select team ids by @nickscamara in #664
O1 crawler example by @ericciarla in #676
[Bug] Fixed screenshot typo and added test for fullpage screenshot by @rafaelsideguide in #677
v1/map improvements + higher limits by @nickscamara in #674
Remove print statement in map by @anjor in #612
fix wrong link to self host documentation by @itasli in #623
feat: kubernetes example optimization by @yekkhan in #639
Rust SDK 1.0.0 by @mogery in #689
feat: Actions by @mogery in #682
Fix the error message when trying search in v0 by @nickscamara in #690
remove space in the examples/o1_web_crawler folder name by @h4r5h4 in #679
o1 job recommender example by @ericciarla in #707
Move auth and check credits operations into an RPC by @mogery in #704
bugfix: using onlyIncludeTags and removeTags together by @skeptrunedev in #685
Concurrency limits by @mogery in #721
Docs: Remove wait_until_done from python-sdk example by @bytrangle in #728
Improves error handler in Node SDK to return the status code by @nickscamara in #727
Fixes crawl failed and webhooks not working properly by @nickscamara in #731
[BUG] Fixed URLs with params by @rafaelsideguide in #732
Fixed the self host issues where methods don't work by @nickscamara in #733
Make sure the entrypoint script has the correct line endings by @busaud in #753
Rm cluster mode + rm fly deployments by @nickscamara in #754
Fixed Issue #734 by @Harsh0707005 in #747
bugfix: self-host crawling doesnt respect limit by @busaud in #755
[BUG] Fixed missing error handling in JS-SDK by @rafaelsideguide in #759
[SKD] Cancel Crawl by @rafaelsideguide in #760
fixed developer.notion special case by @rafaelsideguide in #762
Spelling Corrections in README by @fadkeabhi in #763
[RPC] Improvements to credit_usage rpc by @nickscamara in #767
[BUG] filters failed and unknown jobs now by @rafaelsideguide in #761
[Doc] Better explained how includePaths and excludePaths work by @rafaelsideguide in #766
Update README.md by @busaud in #757
ADDED : Contributors and Back to top by @Ruhi14 in #768
Retries for ACUC RPC + Price credits fallback by @nickscamara in #773
[BUG] added check files on crawl by @rafaelsideguide in #779
[Feat] Performance improvements crawl status filters by @rafaelsideguide in #780
Admin alerts for high usage by @nickscamara in #783
Geolocation support for Firecrawl by @nickscamara in #784
Return all the website metadata by @nickscamara in #785
Extractor options logging v1 fix by @nickscamara in #788
Update requirements.txt by @rishi-raj-jain in #790
Improved /map ranking algorithm for search queries by @nickscamara in #798
Fix Typos and Grammar in SELF_HOST.md by @Mefisto04 in #799
[Bug] encoding error for special token by @rafaelsideguide in #793
[BUG-SDK] missing error in response by @rafaelsideguide in #796
examples: sales web crawler by @rishi-raj-jain in #797
feat: clear ACUC cache endpoint based on team ID by @mogery in #807
feat: skipTlsVerification by @tomkosm in #808
feat: Batch Scrape by @mogery in #789
feat: Auto Recharge Credits + Credit Packs by @nickscamara in #809
Remove ph logs for single_urls by @nickscamara in #829
Bump to gemini-1.5-pro-002 website_qa_with_gemini_caching.ipynb and add flash example by @s-smits in #739
Add SearchApi as a Web Search Tool by @SebastjanPrachovskij in #628
RM wait before interacting by @nickscamara in #838
chore(README.md): use satisfies instead of as for ts example by @twlite in #831
Geo-location rename to location by @nickscamara in #830
concurrency limit fix by @mogery in #824
[feat] Iframe support by @tomkosm in #855
Fix go parser by @tomkosm in #856
Support for the 2 new actions by @nickscamara in #858
Adds support for mobile web scraping + mobile screenshot by @nickscamara in #847
[Feat] Added remove base64 images options (true by default) by @rafaelsideguide in #867
[Fix] Prevent Python Firecrawl logger from interfering with loggers in client applications by @reasonmethis in #613
[BUG] Added trycatch and removed redundancy by @rafaelsideguide in #869
Update CONTRIBUTING.md by @swyxio in https://github.com/mendableai/firecrawl/p...

Contributors

shige, anjor, and 26 other contributors

Assets 2

0 Join discussion

05 Sep 20:28

nickscamara

v1.0.0

554a050

Welcome to v1 - A more reliable and developer friendly API

Firecrawl V1 is here! With that we introduce a more reliable and developer friendly API.

August 29th, 2024

Here is what’s new:

Output Formats for /scrape. Choose what formats you want your output in.
New /map endpoint for getting most of the URLs of a webpage.
Developer friendly API for /crawl/{id} status.
2x Rate Limits for all plans.
Go SDK and Rust SDK
Teams support
API Key Management in the dashboard.
onlyMainContent is now default to true.
/crawl webhooks and websocket support.

Learn more about it here

Start using v1 right away at https://firecrawl.dev

What's Changed (including v0 + v1)

Delete .DS_Store by @szepeviktor in #8
[Bugfix] added normalized apikey to craw/status route by @rafaelsideguide in #12
[Feat] improving reative paths by @rafaelsideguide in #4
Fix typos by @szepeviktor in #9
[Feat] Added html to markdown table parser by @rafaelsideguide in #11
Option to extract only the main content, excluding headers, navs, footers etc. by @nickscamara in #14
[Feat] Adding pdf parser by @rafaelsideguide in #17
adding ci-cd workflow by @rafaelsideguide in #20
adding workflow by @rafaelsideguide in #21
adding env secrets by @rafaelsideguide in #22
[Feat] Added TSDocs and types for js-sdk by @rafaelsideguide in #28
Added option to replace all relative paths with absolute paths by @rafaelsideguide in #25
[Bugfix] Fixed scrape preview test by @rafaelsideguide in #30
Caleb: fixing some documentation and rebuilding the server by @calebpeffer in #32
Rate limit fixes for crawl status by @nickscamara in #36
Better logging by @nickscamara in #35
[Feat] Added type declarations by @rafaelsideguide in #31
Refactor api routes by @nickscamara in #37
Logging by @nickscamara in #38
Cjp/making db auth optional <> Running project locally by @calebpeffer in #40
chore: add context.close by @mattzcarey in #46
Fixes table parsing for websites such as news.ycombinator.com (HN) by @nickscamara in #52
[Feat] Server health check + slack message by @rafaelsideguide in #53
[Feat] Added blocklist for social media urls by @rafaelsideguide in #55
[Feat:mvp] Search Endpoint => serp api + firecrawl => 🔥 🔍 by @nickscamara in #56
[Feat] Added anthropic vision api by @rafaelsideguide in #5
[Bugfix] Trim and Lowercase all urls by @rafaelsideguide in #13
Implements the ability for the crawler to output all the links it found, without scraping by @nickscamara in #34
Serper params by @nickscamara in #62
Support for tbs, filter, lang, country and location with Serper search. by @rogerserper in #61
[Feat] Added allowed urls by @rafaelsideguide in #64
/search support in node sdk by @nickscamara in #72
Free credits increase by @nickscamara in #75
[Bugfix] JS-SDK: Remove dotenv and add tests by @mdp in #68
[Feat] Coupon system by @rafaelsideguide in #66
Specific website params support by @nickscamara in #83
Greenpay fixes by @nickscamara in #84
[Feat] Implemented retry attempts to handle 502 errors by @rafaelsideguide in #67
feat: LLM Extraction (mvp) by @nickscamara in #90
Update README.md by @bllchmbrs in #110
Add Posthog Logging by @ericciarla in #109
Refactor of main web scraper + Partial data streaming by @nickscamara in #120
[Feat] Added includeHTML option by @rafaelsideguide in #126
Cancel Job Route by @nickscamara in #129
[Feat] Added max depth option by @rafaelsideguide in #130
Add keyAuth endpoint by @ericciarla in #131
[Test] Added integration tests suite by @rafaelsideguide in #118
Adds Zod Integration for LLM Extraction in the Firecrawl JS SDK by @nickscamara in #135
[Docs] Updated examples by @rafaelsideguide in #137
Switching to AGPL - We Need Your Consent! by @calebpeffer in #134
Nsc/refactor scraping order by @nickscamara in #139
Update models.ts by @ericciarla in #144
Timeout on /scrape by @nickscamara in #145
[Doc] Added default value for crawlOptions.limit by @rafaelsideguide in #142
feat: 4x-5x faster crawler (fast mode) by @nickscamara in #149
Add Docker Compose for easy self hosting by @chand1012 in #119
refactor: fix typo in WebScraper/index.ts by @eltociear in #27
[Tests] Added crawl test suite -> crawl improvements by @rafaelsideguide in #153
feat: Docx Support by @nickscamara in #158
Fixes pdfs not found if .pdf is not present by @nickscamara in #29
Update README.md: Typo fix by @elimisteve in #160
[Feat] Added rate limits by @rafaelsideguide in #151
Allow override of API URL by @mattjoyce in #166
feat: HyperDX Integration by @nickscamara in #167
beta: Fire-Engine fallback by @nickscamara in #174
Add additional file extensions to crawler.ts by @tractorjuice in #77
[Bug] Fixing /crawl limit by @rafaelsideguide in #143
Update issue templates by @rafaelsideguide in #180
[Feat] Added proxy and media blocking support for Playwright by @JakobStadlhuber in #181
update: wait until body attached in playwright-service by @qyou in #170
feat: Allow privacy/legal/ other pages in social media websites by @nickscamara in #168
[Bug] Added data check for python SDK by @rafaelsideguide in #176
Fix FIRECRAWL_API_URL bug, also various PyLint fixes by @mattjoyce in #178
[Feat] Added idempotency key to crawl route by @rafaelsideguide in #132
Feat: Provide more details for 429 error msg by @simonha9 in #190
Limit on /search is not deterministic by @Keredu in #186
Various PyPi Metadata by @mattjoyce in #191
[Test] Added sdk e2e tests by @rafaelsideguide in #183
Allow users to manually set the waitFor param on /scrape by @nickscamara in #200
[Feat] Added custom scraping conditions for readme docs by @rafaelsideguide in #204
Feat/screenshot support by @ericciarla in #207
feat: New pricing/limits changes by @nickscamara in #216
[sdk] Fixes waiting status not being present on check status by @nickscamara in #218
Fixed fire-engine content bug by @rafaelsideguide in #228
Use @ instead of # for default BULL_AUTH_KEY. Hash mark is reserved...