Website Scraping

Detailed website operator routes, crawl behavior, HTTP methods, and SSE flows.

This route family is mounted under /api/operators/website/*.

Choose the lightest route first

Start with html for simple pages, move to html-js only when rendering is required, and use crawl only when the task is truly multi-page and queue-backed. That keeps latency, browser cost, and operational complexity lower.

Endpoint summary

Method	Path	Purpose	Notes
`POST`	`/api/operators/website/scrape/stream`	SSE website scrape	primary streaming scrape route
`POST`	`/api/operators/website/map`	discover links quickly	non-streaming
`POST`	`/api/operators/website/map/stream`	stream site discovery progress	SSE
`POST`	`/api/operators/website/crawl`	queue-backed crawl job	requires Redis and worker
`POST`	`/api/operators/website/scrape/html`	fast HTML scrape	no JS rendering
`POST`	`/api/operators/website/scrape/html-js`	JS-rendered HTML scrape	browser-rendered
`POST`	`/api/operators/website/scrape/content`	markdown content extraction	uses markdown service when configured
`POST`	`/api/operators/website/scrape/screenshot`	full-page screenshot	binary image result

POST /api/operators/website/scrape/html

Use this for fast HTML retrieval when you do not need JavaScript rendering.

Good for:

static pages
quick validation
lower-cost extraction

POST /api/operators/website/scrape/html-js

Use this when the target page requires browser rendering.

Good for:

JS-heavy sites
pages that require wait conditions
rendered DOM extraction before content conversion

POST /api/operators/website/scrape/content

Returns markdown-oriented content extraction.

Notes:

uses HTML_TO_MARKDOWN_SERVICE_URL when available
can fall back locally when the external markdown service is unavailable

POST /api/operators/website/scrape/screenshot

Returns a binary image response, not a JSON payload.

Use this when:

you need a visual capture
you want a full-page or debugging artifact

POST /api/operators/website/map

Use map when you want quick link discovery rather than full extraction.

POST /api/operators/website/map/stream

Use the streaming variant when you want progressive discovery updates.

POST /api/operators/website/scrape/stream

This route streams progress via SSE.

Common events:

start
progress
result
error
done

POST /api/operators/website/crawl

This is the queued crawl workflow.

Important:

it is not an inline page scrape
it requires Redis
it requires the worker process
results are tracked through the jobs system

Crawl is asynchronous

POST /api/operators/website/crawl hands work to the queue and returns job-oriented state. If Redis or the worker is unavailable, crawl will not behave like the direct scrape endpoints.

Crawl use cases

site-wide collection
async crawling
job tracking and cancellation
reconnectable progress behavior through /api/jobs

Recommended path

try POST /api/operators/website/scrape/html
switch to POST /api/operators/website/scrape/html-js if rendering is required
use POST /api/operators/website/scrape/content when markdown is the real output
use POST /api/operators/website/crawl when the task is queue-backed and multi-page

Troubleshooting

Was this page helpful?

Google AI Search

Google AI Search HTTP and SSE routes for search extraction, cookie bootstrap, targeting controls, and status checks.

Get Started

Self-Hosting

Developers

API Reference

Changelog

Website Scraping

Endpoint summary

POST /api/operators/website/scrape/html

POST /api/operators/website/scrape/html-js

POST /api/operators/website/scrape/content

POST /api/operators/website/scrape/screenshot

POST /api/operators/website/map

POST /api/operators/website/map/stream

POST /api/operators/website/scrape/stream

POST /api/operators/website/crawl

Crawl use cases

Recommended path

Troubleshooting

Website Scraping

Endpoint summary

POST /api/operators/website/scrape/html

POST /api/operators/website/scrape/html-js

POST /api/operators/website/scrape/content

POST /api/operators/website/scrape/screenshot

POST /api/operators/website/map

POST /api/operators/website/map/stream

POST /api/operators/website/scrape/stream

POST /api/operators/website/crawl

Crawl use cases

Recommended path

Troubleshooting

html works but html-js is slow or unstable

crawl returns job problems instead of page content

content extraction does not return expected markdown