Dashboard
DocsUser GuideDocument Processing

Document Processing

After uploading a file or adding a URL, your content goes through a multi-stage processing pipeline. This page explains each status and what it means.

Processing Stages

StageStatusDescription
1uploadingFile is being uploaded to S3 storage. For URLs, the page is being crawled.
2queuedUpload complete. A Celery background task has been created and is waiting for a worker.
3partitioningThe Unstructured library is parsing the document to extract text, tables, and images.
4chunkingExtracted content is being split into semantically coherent chunks (max 3,000 characters each).
5summarisingChunks containing tables or images are being summarised by GPT-4o for better search results.
6vectorizationChunks are being converted to 1,536-dimensional embedding vectors for similarity search.
7completedAll processing is done. The document is now fully searchable.

Error States

StatusDescription
failedAn error occurred during processing. The error message is stored and can be viewed by clicking the document.

Real-Time Updates

The frontend polls the document status every 2 seconds while processing is active. This means you'll see the status badge update automatically — no page refresh needed.

During the summarisation stage, the UI shows progress like “Processing chunk 3 of 12…” for granular visibility.

Retry Behaviour

If the embedding step fails (e.g., due to OpenAI rate limits), the pipeline retries with exponential backoff:

  • Wait 1 second, then retry
  • Wait 2 seconds, then retry
  • Wait 4 seconds, then retry
  • After 3 failed attempts, mark the document as failed
Was this page helpful?