Website Scraping

Detailed website operator routes, crawl behavior, HTTP methods, and SSE flows.

by @saifyxpro

This route family is mounted under /api/operators/website/*.

Choose the lightest route first

Start with html for simple pages, move to html-js only when rendering is required, and use crawl only when the task is truly multi-page and queue-backed. That keeps latency, browser cost, and operational complexity lower.

Endpoint summary


MethodPathPurposeNotes
POST/api/operators/website/scrape/streamSSE website scrapeprimary streaming scrape route
POST/api/operators/website/mapdiscover links quicklynon-streaming
POST/api/operators/website/map/streamstream site discovery progressSSE
POST/api/operators/website/crawlqueue-backed crawl jobrequires Redis and worker
POST/api/operators/website/scrape/htmlfast HTML scrapeno JS rendering
POST/api/operators/website/scrape/html-jsJS-rendered HTML scrapebrowser-rendered
POST/api/operators/website/scrape/contentmarkdown content extractionuses markdown service when configured
POST/api/operators/website/scrape/screenshotfull-page screenshotbinary image result

POST /api/operators/website/scrape/html


Use this for fast HTML retrieval when you do not need JavaScript rendering.

Good for:

  • static pages
  • quick validation
  • lower-cost extraction

POST /api/operators/website/scrape/html-js


Use this when the target page requires browser rendering.

Good for:

  • JS-heavy sites
  • pages that require wait conditions
  • rendered DOM extraction before content conversion

POST /api/operators/website/scrape/content


Returns markdown-oriented content extraction.

Notes:

  • uses HTML_TO_MARKDOWN_SERVICE_URL when available
  • can fall back locally when the external markdown service is unavailable

POST /api/operators/website/scrape/screenshot


Returns a binary image response, not a JSON payload.

Use this when:

  • you need a visual capture
  • you want a full-page or debugging artifact

POST /api/operators/website/map


Use map when you want quick link discovery rather than full extraction.

POST /api/operators/website/map/stream


Use the streaming variant when you want progressive discovery updates.

POST /api/operators/website/scrape/stream


This route streams progress via SSE.

Common events:

  • start
  • progress
  • result
  • error
  • done

POST /api/operators/website/crawl


This is the queued crawl workflow.

Important:

  • it is not an inline page scrape
  • it requires Redis
  • it requires the worker process
  • results are tracked through the jobs system
Crawl is asynchronous

POST /api/operators/website/crawl hands work to the queue and returns job-oriented state. If Redis or the worker is unavailable, crawl will not behave like the direct scrape endpoints.

Crawl use cases

  • site-wide collection
  • async crawling
  • job tracking and cancellation
  • reconnectable progress behavior through /api/jobs

  1. try POST /api/operators/website/scrape/html
  2. switch to POST /api/operators/website/scrape/html-js if rendering is required
  3. use POST /api/operators/website/scrape/content when markdown is the real output
  4. use POST /api/operators/website/crawl when the task is queue-backed and multi-page

Troubleshooting