RAG (Retrieval-Augmented Generation)

Question-answering system using vector embeddings and hybrid search.

Glossary

Term

Description

RAG

Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content.

Embedding

A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions.

Vector Search / kNN

k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching.

BM25

Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches.

Hybrid Search

Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches.

Chunk

A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model’s context and retrieval is more precise.

LLM

Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral).

Ollama

Local LLM runtime for running open-source models. Used for development with models like llama3.1:8b (chat) and mxbai-embed-large (embeddings).

Cosine Similarity

Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks.

Dense Vector

Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale.

Context Window

Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by RAG_MAX_CONTEXT_LENGTH.

Top-K

Number of most relevant results to return from search. Higher values provide more context but may include less relevant content.

num_candidates

Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search.

Architecture

+------------------+     +------------------+     +------------------+
|   Plone CMS      |     |   Redis/RQ       |     |  Elasticsearch   |
|                  |     |   Worker         |     |                  |
|  - REST API      |---->|  - Embedding     |---->|  - Chunks Index  |
|  - Subscribers   |     |    Generation    |     |  - kNN + BM25    |
+------------------+     |  - RAG Ask       |     +------------------+
                         +------------------+
                                  |
                                  v
                         +------------------+
                         |  Ollama/Mistral  |
                         |  - Embeddings    |
                         |  - LLM Chat      |
                         +------------------+

Components

REST API Endpoints

Endpoint

Method

Description

@rag-ask

POST

Ask a question (async by default, sync with "sync": true for logged-in users)

@rag-ask

GET

Poll for async result by job_id

@rag-status

GET

RAG system status and statistics

@rag-index

POST

Manually trigger embedding for single content

@rag-index-all

POST

Reindex all RAG-enabled content

Modules

  • config.py - Environment-based configuration and registry access

  • client.py - OpenAI-compatible API client for embeddings and LLM

  • chunks.py - Elasticsearch chunks index management

  • index.py - Content embedding and hybrid search interface

  • tasks.py - Redis/RQ background tasks for async embedding and question-answering

  • subscribers.py - Zope event handlers for automatic indexing

Configuration

Environment Variables

# Feature flag
RAG_ENABLED=true

# Embedding API (Ollama local / Mistral production)
EMBEDDING_BASE_URL=http://localhost:11434/v1
EMBEDDING_API_KEY=ollama
EMBEDDING_MODEL=mxbai-embed-large
EMBEDDING_DIMENSIONS=1024

# LLM API (Ollama local / Mistral production)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:8b

# Chunking
RAG_CHUNK_SIZE=1000
RAG_CHUNK_OVERLAP=200

# Search
RAG_TOP_K=5
RAG_NUM_CANDIDATES=100

Plone Registry

Content types and LLM behavior are configured via Plone registry:

Record

Description

wcs.backend.rag_content_types

Content types to index (default: ContentPage, News, File, Contact, Book, Chapter)

wcs.backend.rag_system_prompt

System prompt for the LLM

wcs.backend.rag_no_answer_message

Message when no relevant context is found

wcs.backend.rag_error_message

Message when answer generation fails

wcs.backend.rag_llm_temperature

LLM temperature (0.0 = deterministic, 1.0 = creative)

wcs.backend.rag_llm_max_tokens

Maximum tokens in LLM response

wcs.backend.rag_boost_bm25

BM25 text search boost factor

wcs.backend.rag_boost_knn

kNN vector search boost factor

wcs.backend.rag_title_boost_factor

Additional boost for title matches

wcs.backend.rag_score_high_threshold

Minimum score for high confidence results

wcs.backend.rag_score_medium_threshold

Minimum score for medium confidence results

Elasticsearch Index

Separate index {plone-index}-rag-chunks with mapping:

{
  "properties": {
    "chunk_id": {"type": "keyword"},
    "parent_uid": {"type": "keyword"},
    "parent_title": {"type": "text"},
    "parent_path": {"type": "keyword"},
    "portal_type": {"type": "keyword"},
    "allowedRolesAndUsers": {"type": "keyword"},
    "chunk_index": {"type": "integer"},
    "chunk_text": {"type": "text"},
    "embedding": {
      "type": "dense_vector",
      "dims": 1024,
      "index": true,
      "similarity": "cosine"
    }
  }
}

Data Flow

Indexing (Async via Worker)

  1. Content created/modified triggers on_content_added/on_content_modified

  2. queue_embedding_task queues job with ES settings and index name

  3. Worker fetches content via fetch_data (REST API)

  4. Text split into overlapping chunks

  5. Embeddings generated via OpenAI-compatible API

  6. Chunks indexed to Elasticsearch

Question-Answering (Async via Worker)

  1. @rag-ask (POST) receives question from client

  2. Job ID generated from question hash + user security context + optional path

  3. If cached result exists, return immediately

  4. Otherwise, queue rag_ask_task to Redis worker

  5. Return pending status with job_id for polling

  6. Worker performs hybrid search and generates LLM answer

  7. Result cached in Redis (TTL: 5 minutes)

  8. Client polls @rag-ask (GET) with job_id until completed

Local Development

Start Ollama:

ollama serve
ollama pull mxbai-embed-large
ollama pull llama3.1:8b

Enable RAG:

export RAG_ENABLED=true

REST API

Ask a Question (Async)

The @rag-ask endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter.

// Submit question (async)
const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'What are the opening hours?',
        path: '/plone/section'  // optional: filter to specific section
    })
});
const data = await response.json();
// data.status === 'pending', data.job_id === 'ragask_abc123'

Pending response:

{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "status": "pending",
  "job_id": "ragask_abc123",
  "question": "What are the opening hours?"
}

Poll for result:

// Poll until completed
const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', {
    headers: { 'Accept': 'application/json' }
});
const result = await pollResponse.json();

Completed response:

{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "status": "completed",
  "question": "What are the opening hours?",
  "answer": "Based on the information...",
  "sources": [
    {
      "title": "Contact",
      "path": "/contact",
      "portal_type": "Contact",
      "score": 0.92,
      "chunk_index": 0,
      "snippet": "Opening hours: Monday to Friday..."
    }
  ]
}

Ask a Question (Sync)

Logged-in users can request synchronous processing by adding "sync": true. This bypasses the queue and returns the answer directly.

const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'What are the opening hours?',
        sync: true
    })
});
const data = await response.json();

Sync response:

{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "question": "What are the opening hours?",
  "answer": "Based on the information...",
  "sources": [
    {"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0}
  ]
}

Path Filtering

Use the path parameter to restrict search results to a specific section of the site:

const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'How do I configure this?',
        path: '/plone/documentation/admin-guide'
    })
});

This filters results to only include content under the specified path prefix.

Reindex All Content

const response = await fetch('/Plone/@rag-index-all', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    }
});
const data = await response.json();

Response:

{
  "@id": "http://localhost:8080/Plone/@rag-index-all",
  "status": "queued",
  "queued_count": 150,
  "content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"]
}

Index Single Content

const response = await fetch('/Plone/@rag-index', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        uid: 'content-uid-here',
        async: true  // default: true
    })
});
const data = await response.json();

Response:

{
  "@id": "http://localhost:8080/Plone/@rag-index",
  "status": "queued",
  "uid": "content-uid-here"
}

Check Status

const response = await fetch('/Plone/@rag-status', {
    headers: { 'Accept': 'application/json' }
});
const data = await response.json();

Response:

{
  "@id": "http://localhost:8080/Plone/@rag-status",
  "enabled": true,
  "index_name": "plone-rag-chunks",
  "exists": true,
  "chunk_count": 1250,
  "parent_count": 150,
  "dimensions": 1024
}