SPEC-STV-07-RAG-System
SPEC-STV-07 · Spec header. Spec ID: SPEC-STV-07 · Title: RAG Knowledge Base · Version: 1.0.0 · Status: Planned · Authority: Specification · Priority: P0 · Owner role: RAG engineer · Reviewers: Backend architect, Security lead · Last reviewed: 2026-05-11 · Sync targets: app/Services/Rag/**, docs/RAG_SYSTEM.md · Depends on: SPEC-STV-HUB, SPEC-STV-02 · Consumed by: SPEC-STV-08, SPEC-STV-09 · Conflict rule: Hub wins. · Change policy: RAG engineer + Backend architect; Registry bump on backend/embed-model change.
1 · Goal
A workspace-scoped retrieval system that lets AI answer with citations to the workspace's own pages, blocks, comments, files, templates, and database rows. ON/OFF per workspace.
2 · Sources indexed
| source_type | What is chunked | Excluded |
|---|---|---|
page | Title + concatenated visible text from all blocks (excluding archived). | Private pages from other users (respect permissions at query time). |
block | Block-level chunks for long pages (chunked individually so citations point at a specific block). | code blocks where metadata.secret = true. |
comment | Comment body. | Resolved comments older than 30 days (configurable). |
file | Extracted text from PDFs / Markdown / plain text. Images get an OCR pass (P5+). | Files > 50 MiB. Encrypted files. |
template | Template name + description + flattened payload text. | — |
db_row | Concatenation of denormalized value_text per row. | Rows in archived databases. |
3 · Chunking
- Strategy: heading-aware splitter; max 1200 tokens / chunk, 150 token overlap.
- Boundary preference: paragraph > sentence > token.
- Each chunk stores
{ workspace_id, source_type, source_id, chunk_index, text, embedding JSON, content_hash sha256(text) }.
4 · Embedding model
- Default:
text-embedding-3-large(3072 dims). Configurable inrag_settings.embedding_model.
- Provider: same OpenAI key surface as AI; a workspace may flip to a self-hosted provider (
localai,ollama) viarag_settings.provider.
- Reembedding on model change requires a full reindex job (admin-triggered).
5 · Retrieval
- Top-K = 8 default, capped at 20.
- Always filter by
workspace_idAND by the caller's effective page permissions.
- Re-rank: cosine similarity → optional cross-encoder (post-v1) for top-50 → final top-K.
- Hits are returned with
{ page_uuid, block_id?, score, excerpt }.
6 · Citations format (AI side)
When the AI uses RAG hits, the response includes a citations array. The web client renders inline chips [1], [2], etc., resolving to the source page/block. No citation = no claim allowed (AiAnswerService enforces).
7 · Indexing queue
- Trigger: every page/block/comment/file/template/db_row write fires an event; the listener queues
IndexRagSourceJobkeyed by(source_type, source_id).
- Idempotency: the job computes
content_hash; if unchanged, no-op.
- Concurrency: per-workspace
Cache::lock('workspace:{id}:rag-index')prevents thundering herds on bulk import. Per-source jobs do not block each other.
- Backpressure: queue
rag-indexwith low priority; Horizon supervisor limits 4 workers.
- Stale detection: nightly job scans
rag_chunksvs sourceupdated_at; mismatches re-enqueue.
8 · Reindex endpoints
POST /rag/reindex (admin): { scope: "workspace|page|database", id }. GET /rag/status returns { chunks, stale, last_indexed_at, queue_depth }.
9 · Secret exclusion
- Patterns scanned on chunk text before embedding:
sk-[A-Za-z0-9]{20,},ghp_[A-Za-z0-9]{20,},AKIA[A-Z0-9]{16}, JWTs,xox[bp]-, lines containingBEGIN RSA. Matches → drop the chunk and emit anactivity_logswarning.
- Files marked
metadata.is_secret = trueare never embedded.
10 · Per-workspace ON/OFF
settings row rag.enabled = true|false. When disabled, RagService::query() returns 200 with { disabled: true, hits: [] } and the AI Q&A surface tells the user to enable it. Indexing also pauses.
11 · pgvector opt-in (advanced)
When rag_settings.vector_backend = "pgvector" is set on a workspace, a sidecar Postgres schema is provisioned with a table rag_vectors (rag_chunk_id bigint primary key, embedding vector(3072)). The MySQL embedding JSON column is kept for replay and migration; reads use the Postgres ANN index. The default backend stays MySQL JSON — pgvector is per-workspace opt-in, never the system default.
12 · Acceptance criteria
- Toggling
rag.enabledON for a workspace queues a full reindex; OFF stops indexing and short-circuits queries.
- A page edit updates
rag_chunkswithin 60 s on default Horizon settings.
RagService::query()never returns a hit the caller cannot read.
- Secret patterns in source text are dropped before embedding and logged.
GET /rag/statusreports accuratechunks,stale,last_indexed_at,queue_depth.