RAG (Retrieval-Augmented Generation)¶
Question-answering system using vector embeddings and hybrid search.
Glossary¶
Term |
Description |
|---|---|
RAG |
Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content. |
Embedding |
A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions. |
Vector Search / kNN |
k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching. |
BM25 |
Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches. |
Hybrid Search |
Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches. |
Chunk |
A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model’s context and retrieval is more precise. |
LLM |
Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral). |
Ollama |
Local LLM runtime for running open-source models. Used for development with models like |
Cosine Similarity |
Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks. |
Dense Vector |
Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale. |
Context Window |
Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by |
Top-K |
Number of most relevant results to return from search. Higher values provide more context but may include less relevant content. |
num_candidates |
Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search. |
Architecture¶
+------------------+ +------------------+ +------------------+
| Plone CMS | | Redis/RQ | | Elasticsearch |
| | | Worker | | |
| - REST API |---->| - Embedding |---->| - Chunks Index |
| - Subscribers | | Generation | | - kNN + BM25 |
+------------------+ | - RAG Ask | +------------------+
+------------------+
|
v
+------------------+
| Ollama/Mistral |
| - Embeddings |
| - LLM Chat |
+------------------+
Components¶
REST API Endpoints¶
Endpoint |
Method |
Description |
|---|---|---|
|
POST |
Ask a question (async by default, sync with |
|
GET |
Poll for async result by job_id |
|
GET |
RAG system status and statistics |
|
POST |
Manually trigger embedding for single content |
|
POST |
Reindex all RAG-enabled content |
Modules¶
config.py - Environment-based configuration and registry access
client.py - OpenAI-compatible API client for embeddings and LLM
chunks.py - Elasticsearch chunks index management
index.py - Content embedding and hybrid search interface
tasks.py - Redis/RQ background tasks for async embedding and question-answering
subscribers.py - Zope event handlers for automatic indexing
Configuration¶
Environment Variables¶
# Feature flag
RAG_ENABLED=true
# Embedding API (Ollama local / Mistral production)
EMBEDDING_BASE_URL=http://localhost:11434/v1
EMBEDDING_API_KEY=ollama
EMBEDDING_MODEL=mxbai-embed-large
EMBEDDING_DIMENSIONS=1024
# LLM API (Ollama local / Mistral production)
LLM_BASE_URL=http://localhost:11434/v1
LLM_API_KEY=ollama
LLM_MODEL=llama3.1:8b
# Chunking
RAG_CHUNK_SIZE=1000
RAG_CHUNK_OVERLAP=200
# Search
RAG_TOP_K=5
RAG_NUM_CANDIDATES=100
Plone Registry¶
Content types and LLM behavior are configured via Plone registry:
Record |
Description |
|---|---|
|
Content types to index (default: ContentPage, News, File, Contact, Book, Chapter) |
|
System prompt for the LLM |
|
Message when no relevant context is found |
|
Message when answer generation fails |
|
LLM temperature (0.0 = deterministic, 1.0 = creative) |
|
Maximum tokens in LLM response |
|
BM25 text search boost factor |
|
kNN vector search boost factor |
|
Additional boost for title matches |
|
Minimum score for high confidence results |
|
Minimum score for medium confidence results |
Elasticsearch Index¶
Separate index {plone-index}-rag-chunks with mapping:
{
"properties": {
"chunk_id": {"type": "keyword"},
"parent_uid": {"type": "keyword"},
"parent_title": {"type": "text"},
"parent_path": {"type": "keyword"},
"portal_type": {"type": "keyword"},
"allowedRolesAndUsers": {"type": "keyword"},
"chunk_index": {"type": "integer"},
"chunk_text": {"type": "text"},
"embedding": {
"type": "dense_vector",
"dims": 1024,
"index": true,
"similarity": "cosine"
}
}
}
Data Flow¶
Indexing (Async via Worker)¶
Content created/modified triggers
on_content_added/on_content_modifiedqueue_embedding_taskqueues job with ES settings and index nameWorker fetches content via
fetch_data(REST API)Text split into overlapping chunks
Embeddings generated via OpenAI-compatible API
Chunks indexed to Elasticsearch
Question-Answering (Async via Worker)¶
@rag-ask(POST) receives question from clientJob ID generated from question hash + user security context + optional path
If cached result exists, return immediately
Otherwise, queue
rag_ask_taskto Redis workerReturn pending status with job_id for polling
Worker performs hybrid search and generates LLM answer
Result cached in Redis (TTL: 5 minutes)
Client polls
@rag-ask(GET) with job_id until completed
Local Development¶
Start Ollama:
ollama serve
ollama pull mxbai-embed-large
ollama pull llama3.1:8b
Enable RAG:
export RAG_ENABLED=true
REST API¶
Ask a Question (Async)¶
The @rag-ask endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter.
// Submit question (async)
const response = await fetch('/Plone/@rag-ask', {
method: 'POST',
headers: {
'Accept': 'application/json',
'Content-Type': 'application/json'
},
body: JSON.stringify({
question: 'What are the opening hours?',
path: '/plone/section' // optional: filter to specific section
})
});
const data = await response.json();
// data.status === 'pending', data.job_id === 'ragask_abc123'
Pending response:
{
"@id": "http://localhost:8080/Plone/@rag-ask",
"status": "pending",
"job_id": "ragask_abc123",
"question": "What are the opening hours?"
}
Poll for result:
// Poll until completed
const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', {
headers: { 'Accept': 'application/json' }
});
const result = await pollResponse.json();
Completed response:
{
"@id": "http://localhost:8080/Plone/@rag-ask",
"status": "completed",
"question": "What are the opening hours?",
"answer": "Based on the information...",
"sources": [
{
"title": "Contact",
"path": "/contact",
"portal_type": "Contact",
"score": 0.92,
"chunk_index": 0,
"snippet": "Opening hours: Monday to Friday..."
}
]
}
Ask a Question (Sync)¶
Logged-in users can request synchronous processing by adding "sync": true. This bypasses the queue and returns the answer directly.
const response = await fetch('/Plone/@rag-ask', {
method: 'POST',
headers: {
'Accept': 'application/json',
'Content-Type': 'application/json'
},
body: JSON.stringify({
question: 'What are the opening hours?',
sync: true
})
});
const data = await response.json();
Sync response:
{
"@id": "http://localhost:8080/Plone/@rag-ask",
"question": "What are the opening hours?",
"answer": "Based on the information...",
"sources": [
{"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0}
]
}
Path Filtering¶
Use the path parameter to restrict search results to a specific section of the site:
const response = await fetch('/Plone/@rag-ask', {
method: 'POST',
headers: {
'Accept': 'application/json',
'Content-Type': 'application/json'
},
body: JSON.stringify({
question: 'How do I configure this?',
path: '/plone/documentation/admin-guide'
})
});
This filters results to only include content under the specified path prefix.
Reindex All Content¶
const response = await fetch('/Plone/@rag-index-all', {
method: 'POST',
headers: {
'Accept': 'application/json',
'Content-Type': 'application/json'
}
});
const data = await response.json();
Response:
{
"@id": "http://localhost:8080/Plone/@rag-index-all",
"status": "queued",
"queued_count": 150,
"content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"]
}
Index Single Content¶
const response = await fetch('/Plone/@rag-index', {
method: 'POST',
headers: {
'Accept': 'application/json',
'Content-Type': 'application/json'
},
body: JSON.stringify({
uid: 'content-uid-here',
async: true // default: true
})
});
const data = await response.json();
Response:
{
"@id": "http://localhost:8080/Plone/@rag-index",
"status": "queued",
"uid": "content-uid-here"
}
Check Status¶
const response = await fetch('/Plone/@rag-status', {
headers: { 'Accept': 'application/json' }
});
const data = await response.json();
Response:
{
"@id": "http://localhost:8080/Plone/@rag-status",
"enabled": true,
"index_name": "plone-rag-chunks",
"exists": true,
"chunk_count": 1250,
"parent_count": 150,
"dimensions": 1024
}