When you upload a document or add a URL, Opentrace processes it through a multi-stage ingestion pipeline that transforms raw content into searchable, AI-ready chunks.
S3 / ScrapingBee
│
▼
1. Partition — Unstructured (hi_res): extracts text, tables, and images
│
▼
2. Chunk — chunk_by_title (max 3,000 chars, soft 2,400, merge < 500)
│
▼
3. Summarise — GPT-4o generates search-optimised summaries for
table/image chunks; plain text passes through unchanged
│
▼
4. Embed — text-embedding-3-large (1,536 dims), batched in groups of 10
│
▼
5. Store — Persist chunks + embeddings to document_chunks tableThe document is parsed using the Unstructured library with the hi_res strategy. This extracts:
For PDFs, this includes OCR-based extraction for scanned documents. For web URLs, the HTML is first crawled via ScrapingBee, then partitioned.
Extracted elements are grouped into chunks using chunk_by_title, which creates semantically coherent groupings:
| Parameter | Value | Purpose |
|---|---|---|
max_characters | 3,000 | Hard limit — never exceed this per chunk |
new_after_n_chars | 2,400 | Soft limit — prefer to start a new chunk after this |
combine_text_under_n_chars | 500 | Merge tiny chunks under this size with neighbours |
Not every chunk needs AI summarisation. Plain text chunks pass through unchanged — they're already semantic and searchable.
Chunks containing tables or images are sent to GPT-4o, which generates a search-optimised text summary. This ensures that visual content (charts, data tables) becomes discoverable through text-based search.
The summarisation step updates progress in real-time via database polling: “Processing chunk 3 of 12…” — so users see live progress in the UI.
Each chunk's content (or its AI summary) is embedded using OpenAI's text-embedding-3-large model, producing a 1,536-dimensional vector.
Embeddings are generated in batches of 10 with exponential backoff retry (up to 3 attempts) to handle API rate limits gracefully.
Chunks and their embeddings are stored in the document_chunks table in Supabase (PostgreSQL + pgvector). Each chunk stores:
content — the searchable text (or AI summary)original_content — original text, tables, and images for displayembedding — 1,536-dim vector for similarity searchfts — auto-generated tsvector column for full-text keyword searchpage_number, chunk_index — metadata for citationsThe entire pipeline runs asynchronously via Celery (backed by Redis). When a document is uploaded, a Celery task is queued immediately, freeing the API to respond. The processing status is written to the database after each stage, allowing the frontend to poll for progress.