Document Processing

After uploading a file or adding a URL, your content goes through a multi-stage processing pipeline. This page explains each status and what it means.

Processing Stages

Stage	Status	Description
1	`uploading`	File is being uploaded to S3 storage. For URLs, the page is being crawled.
2	`queued`	Upload complete. A Celery background task has been created and is waiting for a worker.
3	`partitioning`	The Unstructured library is parsing the document to extract text, tables, and images.
4	`chunking`	Extracted content is being split into semantically coherent chunks (max 3,000 characters each).
5	`summarising`	Chunks containing tables or images are being summarised by GPT-4o for better search results.
6	`vectorization`	Chunks are being converted to 1,536-dimensional embedding vectors for similarity search.
7	`completed`	All processing is done. The document is now fully searchable.

Error States

Status	Description
`failed`	An error occurred during processing. The error message is stored and can be viewed by clicking the document.

Real-Time Updates

The frontend polls the document status every 2 seconds while processing is active. This means you'll see the status badge update automatically — no page refresh needed.

During the summarisation stage, the UI shows progress like “Processing chunk 3 of 12…” for granular visibility.

Retry Behaviour

If the embedding step fails (e.g., due to OpenAI rate limits), the pipeline retries with exponential backoff:

Wait 1 second, then retry
Wait 2 seconds, then retry
Wait 4 seconds, then retry
After 3 failed attempts, mark the document as failed

Was this page helpful?

PreviousAdding Web URLs NextInspecting Documents