Website Scraping
Detailed website operator routes, crawl behavior, HTTP methods, and SSE flows.
by @saifyxpro
This route family is mounted under /api/operators/website/*.
Start with html for simple pages, move to html-js only when rendering is required, and use crawl only when the task is truly multi-page and queue-backed. That keeps latency, browser cost, and operational complexity lower.
Endpoint summary
| Method | Path | Purpose | Notes |
|---|---|---|---|
POST | /api/operators/website/scrape/stream | SSE website scrape | primary streaming scrape route |
POST | /api/operators/website/map | discover links quickly | non-streaming |
POST | /api/operators/website/map/stream | stream site discovery progress | SSE |
POST | /api/operators/website/crawl | queue-backed crawl job | requires Redis and worker |
POST | /api/operators/website/scrape/html | fast HTML scrape | no JS rendering |
POST | /api/operators/website/scrape/html-js | JS-rendered HTML scrape | browser-rendered |
POST | /api/operators/website/scrape/content | markdown content extraction | uses markdown service when configured |
POST | /api/operators/website/scrape/screenshot | full-page screenshot | binary image result |
POST /api/operators/website/scrape/html
Use this for fast HTML retrieval when you do not need JavaScript rendering.
Good for:
- static pages
- quick validation
- lower-cost extraction
POST /api/operators/website/scrape/html-js
Use this when the target page requires browser rendering.
Good for:
- JS-heavy sites
- pages that require wait conditions
- rendered DOM extraction before content conversion
POST /api/operators/website/scrape/content
Returns markdown-oriented content extraction.
Notes:
- uses
HTML_TO_MARKDOWN_SERVICE_URLwhen available - can fall back locally when the external markdown service is unavailable
POST /api/operators/website/scrape/screenshot
Returns a binary image response, not a JSON payload.
Use this when:
- you need a visual capture
- you want a full-page or debugging artifact
POST /api/operators/website/map
Use map when you want quick link discovery rather than full extraction.
POST /api/operators/website/map/stream
Use the streaming variant when you want progressive discovery updates.
POST /api/operators/website/scrape/stream
This route streams progress via SSE.
Common events:
startprogressresulterrordone
POST /api/operators/website/crawl
This is the queued crawl workflow.
Important:
- it is not an inline page scrape
- it requires Redis
- it requires the worker process
- results are tracked through the jobs system
POST /api/operators/website/crawl hands work to the queue and returns job-oriented state. If Redis or the worker is unavailable, crawl will not behave like the direct scrape endpoints.
Crawl use cases
- site-wide collection
- async crawling
- job tracking and cancellation
- reconnectable progress behavior through
/api/jobs
Recommended path
- try
POST /api/operators/website/scrape/html - switch to
POST /api/operators/website/scrape/html-jsif rendering is required - use
POST /api/operators/website/scrape/contentwhen markdown is the real output - use
POST /api/operators/website/crawlwhen the task is queue-backed and multi-page