SPEC-STV-07-RAG-System

📜

SPEC-STV-07 · Spec header. Spec ID: SPEC-STV-07 · Title: RAG Knowledge Base · Version: 1.0.0 · Status: Planned · Authority: Specification · Priority: P0 · Owner role: RAG engineer · Reviewers: Backend architect, Security lead · Last reviewed: 2026-05-11 · Sync targets: app/Services/Rag/**, docs/RAG_SYSTEM.md · Depends on: SPEC-STV-HUB, SPEC-STV-02 · Consumed by: SPEC-STV-08, SPEC-STV-09 · Conflict rule: Hub wins. · Change policy: RAG engineer + Backend architect; Registry bump on backend/embed-model change.

1 · Goal

A workspace-scoped retrieval system that lets AI answer with citations to the workspace's own pages, blocks, comments, files, templates, and database rows. ON/OFF per workspace.

2 · Sources indexed

source_type	What is chunked	Excluded
`page`	Title + concatenated visible text from all blocks (excluding archived).	Private pages from other users (respect permissions at query time).
`block`	Block-level chunks for long pages (chunked individually so citations point at a specific block).	`code` blocks where `metadata.secret = true`.
`comment`	Comment body.	Resolved comments older than 30 days (configurable).
`file`	Extracted text from PDFs / Markdown / plain text. Images get an OCR pass (P5+).	Files > 50 MiB. Encrypted files.
`template`	Template name + description + flattened payload text.	—
`db_row`	Concatenation of denormalized `value_text` per row.	Rows in archived databases.

3 · Chunking

Strategy: heading-aware splitter; max 1200 tokens / chunk, 150 token overlap.

Boundary preference: paragraph > sentence > token.

Each chunk stores { workspace_id, source_type, source_id, chunk_index, text, embedding JSON, content_hash sha256(text) }.

4 · Embedding model

Default: text-embedding-3-large (3072 dims). Configurable in rag_settings.embedding_model.

Provider: same OpenAI key surface as AI; a workspace may flip to a self-hosted provider (localai, ollama) via rag_settings.provider.

Reembedding on model change requires a full reindex job (admin-triggered).

5 · Retrieval

Top-K = 8 default, capped at 20.

Always filter by workspace_id AND by the caller's effective page permissions.

Re-rank: cosine similarity → optional cross-encoder (post-v1) for top-50 → final top-K.

Hits are returned with { page_uuid, block_id?, score, excerpt }.

6 · Citations format (AI side)

When the AI uses RAG hits, the response includes a citations array. The web client renders inline chips [1], [2], etc., resolving to the source page/block. No citation = no claim allowed (AiAnswerService enforces).

7 · Indexing queue

Trigger: every page/block/comment/file/template/db_row write fires an event; the listener queues IndexRagSourceJob keyed by (source_type, source_id).

Idempotency: the job computes content_hash; if unchanged, no-op.

Concurrency: per-workspace Cache::lock('workspace:{id}:rag-index') prevents thundering herds on bulk import. Per-source jobs do not block each other.

Backpressure: queue rag-index with low priority; Horizon supervisor limits 4 workers.

Stale detection: nightly job scans rag_chunks vs source updated_at; mismatches re-enqueue.

8 · Reindex endpoints

POST /rag/reindex (admin): { scope: "workspace|page|database", id }. GET /rag/status returns { chunks, stale, last_indexed_at, queue_depth }.

9 · Secret exclusion

Patterns scanned on chunk text before embedding: sk-[A-Za-z0-9]{20,}, ghp_[A-Za-z0-9]{20,}, AKIA[A-Z0-9]{16}, JWTs, xox[bp]-, lines containing BEGIN RSA. Matches → drop the chunk and emit an activity_logs warning.

Files marked metadata.is_secret = true are never embedded.

10 · Per-workspace ON/OFF

settings row rag.enabled = true|false. When disabled, RagService::query() returns 200 with { disabled: true, hits: [] } and the AI Q&A surface tells the user to enable it. Indexing also pauses.

11 · pgvector opt-in (advanced)

When rag_settings.vector_backend = "pgvector" is set on a workspace, a sidecar Postgres schema is provisioned with a table rag_vectors (rag_chunk_id bigint primary key, embedding vector(3072)). The MySQL embedding JSON column is kept for replay and migration; reads use the Postgres ANN index. The default backend stays MySQL JSON — pgvector is per-workspace opt-in, never the system default.

12 · Acceptance criteria

Toggling rag.enabled ON for a workspace queues a full reindex; OFF stops indexing and short-circuits queries.

A page edit updates rag_chunks within 60 s on default Horizon settings.

RagService::query() never returns a hit the caller cannot read.

Secret patterns in source text are dropped before embedding and logged.

GET /rag/status reports accurate chunks, stale, last_indexed_at, queue_depth.