Ingestion Pipeline

When you upload a document or add a URL, Opentrace processes it through a multi-stage ingestion pipeline that transforms raw content into searchable, AI-ready chunks.

Pipeline Overview

text

S3 / ScrapingBee
      │
      ▼
1. Partition   — Unstructured (hi_res): extracts text, tables, and images
      │
      ▼
2. Chunk       — chunk_by_title (max 3,000 chars, soft 2,400, merge < 500)
      │
      ▼
3. Summarise   — GPT-4o generates search-optimised summaries for
                 table/image chunks; plain text passes through unchanged
      │
      ▼
4. Embed       — text-embedding-3-large (1,536 dims), batched in groups of 10
      │
      ▼
5. Store       — Persist chunks + embeddings to document_chunks table

Stage 1: Partitioning

The document is parsed using the Unstructured library with the hi_res strategy. This extracts:

Text elements — paragraphs, headings, list items
Tables — HTML table representations
Images — base64-encoded image data

For PDFs, this includes OCR-based extraction for scanned documents. For web URLs, the HTML is first crawled via ScrapingBee, then partitioned.

Stage 2: Chunking

Extracted elements are grouped into chunks using chunk_by_title, which creates semantically coherent groupings:

Parameter	Value	Purpose
`max_characters`	3,000	Hard limit — never exceed this per chunk
`new_after_n_chars`	2,400	Soft limit — prefer to start a new chunk after this
`combine_text_under_n_chars`	500	Merge tiny chunks under this size with neighbours

Stage 3: Summarisation

Not every chunk needs AI summarisation. Plain text chunks pass through unchanged — they're already semantic and searchable.

Chunks containing tables or images are sent to GPT-4o, which generates a search-optimised text summary. This ensures that visual content (charts, data tables) becomes discoverable through text-based search.

Note

The summarisation step updates progress in real-time via database polling: “Processing chunk 3 of 12…” — so users see live progress in the UI.

Stage 4: Embedding

Each chunk's content (or its AI summary) is embedded using OpenAI's text-embedding-3-large model, producing a 1,536-dimensional vector.

Embeddings are generated in batches of 10 with exponential backoff retry (up to 3 attempts) to handle API rate limits gracefully.

Stage 5: Storage

Chunks and their embeddings are stored in the document_chunks table in Supabase (PostgreSQL + pgvector). Each chunk stores:

content — the searchable text (or AI summary)
original_content — original text, tables, and images for display
embedding — 1,536-dim vector for similarity search
fts — auto-generated tsvector column for full-text keyword search
page_number, chunk_index — metadata for citations

Processing as a Background Task

The entire pipeline runs asynchronously via Celery (backed by Redis). When a document is uploaded, a Celery task is queued immediately, freeing the API to respond. The processing status is written to the database after each stage, allowing the frontend to poll for progress.

Was this page helpful?

PreviousKnowledge Base NextRAG Strategies