RAG (Retrieval-Augmented Generation)¶

Question-answering system using vector embeddings and hybrid search.

Glossary¶

Term	Description
RAG	Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content.
Embedding	A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions.
Vector Search / kNN	k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching.
BM25	Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches.
Hybrid Search	Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches.
Chunk	A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model’s context and retrieval is more precise.
LLM	Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral).
Ollama	Local LLM runtime for running open-source models. Used for development with models like `llama3.1:8b` (chat) and `mxbai-embed-large` (embeddings).
Cosine Similarity	Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks.
Dense Vector	Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale.
Context Window	Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by `RAG_MAX_CONTEXT_LENGTH`.
Top-K	Number of most relevant results to return from search. Higher values provide more context but may include less relevant content.
num_candidates	Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search.

Architecture¶

+------------------+     +------------------+     +------------------+
|   Plone CMS      |     |   Redis/RQ       |     |  Elasticsearch   |
|                  |     |   Worker         |     |                  |
|  - REST API      |---->|  - Embedding     |---->|  - Chunks Index  |
|  - Subscribers   |     |    Generation    |     |  - kNN + BM25    |
+------------------+     |  - RAG Ask       |     +------------------+
                         +------------------+
                                  |
                                  v
                         +------------------+
                         |  Ollama/Mistral  |
                         |  - Embeddings    |
                         |  - LLM Chat      |
                         +------------------+

Components¶

REST API Endpoints¶

Endpoint	Method	Description
`@rag-ask`	POST	Ask a question (async by default, sync with `"sync": true` for logged-in users)
`@rag-ask`	GET	Poll for async result by job_id
`@rag-status`	GET	RAG system status and statistics
`@rag-index`	POST	Manually trigger embedding for single content
`@rag-index-all`	POST	Reindex all RAG-enabled content

Modules¶

config.py - Environment-based configuration and registry access
client.py - OpenAI-compatible API client for embeddings and LLM
chunks.py - Elasticsearch chunks index management
index.py - Content embedding and hybrid search interface
tasks.py - Redis/RQ background tasks for async embedding and question-answering
subscribers.py - Zope event handlers for automatic indexing

Configuration¶

Environment Variables¶

# Feature flag
RAG_ENABLED=true

# Embedding API (Ollama local / Mistral production)
EMBEDDING_BASE_URL=http://localhost:11434/v1
EMBEDDING_API_KEY=ollama
EMBEDDING_MODEL=mxbai-embed-large
EMBEDDING_DIMENSIONS=1024

# LLM API (Ollama local / Mistral production)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:8b

# Chunking
RAG_CHUNK_SIZE=1000
RAG_CHUNK_OVERLAP=200

# Search
RAG_TOP_K=5
RAG_NUM_CANDIDATES=100

Plone Registry¶

Content types and LLM behavior are configured via Plone registry:

Record	Description
`wcs.backend.rag_content_types`	Content types to index (default: ContentPage, News, File, Contact, Book, Chapter)
`wcs.backend.rag_system_prompt`	System prompt for the LLM
`wcs.backend.rag_no_answer_message`	Message when no relevant context is found
`wcs.backend.rag_error_message`	Message when answer generation fails
`wcs.backend.rag_llm_temperature`	LLM temperature (0.0 = deterministic, 1.0 = creative)
`wcs.backend.rag_llm_max_tokens`	Maximum tokens in LLM response
`wcs.backend.rag_boost_bm25`	BM25 text search boost factor
`wcs.backend.rag_boost_knn`	kNN vector search boost factor
`wcs.backend.rag_title_boost_factor`	Additional boost for title matches
`wcs.backend.rag_score_high_threshold`	Minimum score for high confidence results
`wcs.backend.rag_score_medium_threshold`	Minimum score for medium confidence results

Elasticsearch Index¶

Separate index {plone-index}-rag-chunks with mapping:

{
  "properties": {
    "chunk_id": {"type": "keyword"},
    "parent_uid": {"type": "keyword"},
    "parent_title": {"type": "text"},
    "parent_path": {"type": "keyword"},
    "portal_type": {"type": "keyword"},
    "allowedRolesAndUsers": {"type": "keyword"},
    "chunk_index": {"type": "integer"},
    "chunk_text": {"type": "text"},
    "embedding": {
      "type": "dense_vector",
      "dims": 1024,
      "index": true,
      "similarity": "cosine"
    }
  }
}

Data Flow¶

Indexing (Async via Worker)¶

Content created/modified triggers on_content_added/on_content_modified
queue_embedding_task queues job with ES settings and index name
Worker fetches content via fetch_data (REST API)
Text split into overlapping chunks
Embeddings generated via OpenAI-compatible API
Chunks indexed to Elasticsearch

Question-Answering (Async via Worker)¶

@rag-ask (POST) receives question from client
Job ID generated from question hash + user security context + optional path
If cached result exists, return immediately
Otherwise, queue rag_ask_task to Redis worker
Return pending status with job_id for polling
Worker performs hybrid search and generates LLM answer
Result cached in Redis (TTL: 5 minutes)
Client polls @rag-ask (GET) with job_id until completed

Local Development¶

Start Ollama:

ollama serve
ollama pull mxbai-embed-large
ollama pull llama3.1:8b

Enable RAG:

export RAG_ENABLED=true

REST API¶

Ask a Question (Async)¶

The @rag-ask endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter.

// Submit question (async)
const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'What are the opening hours?',
        path: '/plone/section'  // optional: filter to specific section
    })
});
const data = await response.json();
// data.status === 'pending', data.job_id === 'ragask_abc123'

Pending response:

{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "status": "pending",
  "job_id": "ragask_abc123",
  "question": "What are the opening hours?"
}

Poll for result:

// Poll until completed
const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', {
    headers: { 'Accept': 'application/json' }
});
const result = await pollResponse.json();

Completed response:

{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "status": "completed",
  "question": "What are the opening hours?",
  "answer": "Based on the information...",
  "sources": [
    {
      "title": "Contact",
      "path": "/contact",
      "portal_type": "Contact",
      "score": 0.92,
      "chunk_index": 0,
      "snippet": "Opening hours: Monday to Friday..."
    }
  ]
}

Ask a Question (Sync)¶

Logged-in users can request synchronous processing by adding "sync": true. This bypasses the queue and returns the answer directly.

const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'What are the opening hours?',
        sync: true
    })
});
const data = await response.json();

Sync response:

{
  "@id": "http://localhost:8080/Plone/@rag-ask",
  "question": "What are the opening hours?",
  "answer": "Based on the information...",
  "sources": [
    {"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0}
  ]
}

Path Filtering¶

Use the path parameter to restrict search results to a specific section of the site:

const response = await fetch('/Plone/@rag-ask', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        question: 'How do I configure this?',
        path: '/plone/documentation/admin-guide'
    })
});

This filters results to only include content under the specified path prefix.

Reindex All Content¶

const response = await fetch('/Plone/@rag-index-all', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    }
});
const data = await response.json();

Response:

{
  "@id": "http://localhost:8080/Plone/@rag-index-all",
  "status": "queued",
  "queued_count": 150,
  "content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"]
}

Index Single Content¶

const response = await fetch('/Plone/@rag-index', {
    method: 'POST',
    headers: {
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    body: JSON.stringify({
        uid: 'content-uid-here',
        async: true  // default: true
    })
});
const data = await response.json();

Response:

{
  "@id": "http://localhost:8080/Plone/@rag-index",
  "status": "queued",
  "uid": "content-uid-here"
}

Check Status¶

const response = await fetch('/Plone/@rag-status', {
    headers: { 'Accept': 'application/json' }
});
const data = await response.json();

Response:

{
  "@id": "http://localhost:8080/Plone/@rag-status",
  "enabled": true,
  "index_name": "plone-rag-chunks",
  "exists": true,
  "chunk_count": 1250,
  "parent_count": 150,
  "dimensions": 1024
}

RAG (Retrieval-Augmented Generation)¶

Glossary¶

Architecture¶

Components¶

REST API Endpoints¶

Modules¶

Configuration¶

Environment Variables¶

Plone Registry¶

Elasticsearch Index¶

Data Flow¶

Indexing (Async via Worker)¶

Question-Answering (Async via Worker)¶

Local Development¶

REST API¶

Ask a Question (Async)¶

Ask a Question (Sync)¶

Path Filtering¶

Reindex All Content¶

Index Single Content¶

Check Status¶

7inOne

Navigation

Related Topics