RAG (Retrieval-Augmented Generation) ===================================== Question-answering system using vector embeddings and hybrid search. Glossary -------- .. list-table:: :widths: 20 80 :header-rows: 1 * - Term - Description * - **RAG** - Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content. * - **Embedding** - A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions. * - **Vector Search / kNN** - k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching. * - **BM25** - Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches. * - **Hybrid Search** - Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches. * - **Chunk** - A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model's context and retrieval is more precise. * - **LLM** - Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral). * - **Ollama** - Local LLM runtime for running open-source models. Used for development with models like ``llama3.1:8b`` (chat) and ``mxbai-embed-large`` (embeddings). * - **Cosine Similarity** - Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks. * - **Dense Vector** - Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale. * - **Context Window** - Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by ``RAG_MAX_CONTEXT_LENGTH``. * - **Top-K** - Number of most relevant results to return from search. Higher values provide more context but may include less relevant content. * - **num_candidates** - Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search. Architecture ------------ .. code-block:: text +------------------+ +------------------+ +------------------+ | Plone CMS | | Redis/RQ | | Elasticsearch | | | | Worker | | | | - REST API |---->| - Embedding |---->| - Chunks Index | | - Subscribers | | Generation | | - kNN + BM25 | +------------------+ | - RAG Ask | +------------------+ +------------------+ | v +------------------+ | Ollama/Mistral | | - Embeddings | | - LLM Chat | +------------------+ Components ---------- REST API Endpoints ^^^^^^^^^^^^^^^^^^ .. list-table:: :widths: 25 15 60 :header-rows: 1 * - Endpoint - Method - Description * - ``@rag-ask`` - POST - Ask a question (async by default, sync with ``"sync": true`` for logged-in users) * - ``@rag-ask`` - GET - Poll for async result by job_id * - ``@rag-status`` - GET - RAG system status and statistics * - ``@rag-index`` - POST - Manually trigger embedding for single content * - ``@rag-index-all`` - POST - Reindex all RAG-enabled content Modules ^^^^^^^ - **config.py** - Environment-based configuration and registry access - **client.py** - OpenAI-compatible API client for embeddings and LLM - **chunks.py** - Elasticsearch chunks index management - **index.py** - Content embedding and hybrid search interface - **tasks.py** - Redis/RQ background tasks for async embedding and question-answering - **subscribers.py** - Zope event handlers for automatic indexing Configuration ------------- Environment Variables ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash # Feature flag RAG_ENABLED=true # Embedding API (Ollama local / Mistral production) EMBEDDING_BASE_URL=http://localhost:11434/v1 EMBEDDING_API_KEY=ollama EMBEDDING_MODEL=mxbai-embed-large EMBEDDING_DIMENSIONS=1024 # LLM API (Ollama local / Mistral production) LLM_BASE_URL=http://localhost:11434/v1 LLM_API_KEY=ollama LLM_MODEL=llama3.1:8b # Chunking RAG_CHUNK_SIZE=1000 RAG_CHUNK_OVERLAP=200 # Search RAG_TOP_K=5 RAG_NUM_CANDIDATES=100 Plone Registry ^^^^^^^^^^^^^^ Content types and LLM behavior are configured via Plone registry: .. list-table:: :widths: 40 60 :header-rows: 1 * - Record - Description * - ``wcs.backend.rag_content_types`` - Content types to index (default: ContentPage, News, File, Contact, Book, Chapter) * - ``wcs.backend.rag_system_prompt`` - System prompt for the LLM * - ``wcs.backend.rag_no_answer_message`` - Message when no relevant context is found * - ``wcs.backend.rag_error_message`` - Message when answer generation fails * - ``wcs.backend.rag_llm_temperature`` - LLM temperature (0.0 = deterministic, 1.0 = creative) * - ``wcs.backend.rag_llm_max_tokens`` - Maximum tokens in LLM response * - ``wcs.backend.rag_boost_bm25`` - BM25 text search boost factor * - ``wcs.backend.rag_boost_knn`` - kNN vector search boost factor * - ``wcs.backend.rag_title_boost_factor`` - Additional boost for title matches * - ``wcs.backend.rag_score_high_threshold`` - Minimum score for high confidence results * - ``wcs.backend.rag_score_medium_threshold`` - Minimum score for medium confidence results Elasticsearch Index ------------------- Separate index ``{plone-index}-rag-chunks`` with mapping: .. code-block:: json { "properties": { "chunk_id": {"type": "keyword"}, "parent_uid": {"type": "keyword"}, "parent_title": {"type": "text"}, "parent_path": {"type": "keyword"}, "portal_type": {"type": "keyword"}, "allowedRolesAndUsers": {"type": "keyword"}, "chunk_index": {"type": "integer"}, "chunk_text": {"type": "text"}, "embedding": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine" } } } Data Flow --------- Indexing (Async via Worker) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. Content created/modified triggers ``on_content_added``/``on_content_modified`` 2. ``queue_embedding_task`` queues job with ES settings and index name 3. Worker fetches content via ``fetch_data`` (REST API) 4. Text split into overlapping chunks 5. Embeddings generated via OpenAI-compatible API 6. Chunks indexed to Elasticsearch Question-Answering (Async via Worker) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. ``@rag-ask`` (POST) receives question from client 2. Job ID generated from question hash + user security context + optional path 3. If cached result exists, return immediately 4. Otherwise, queue ``rag_ask_task`` to Redis worker 5. Return pending status with job_id for polling 6. Worker performs hybrid search and generates LLM answer 7. Result cached in Redis (TTL: 5 minutes) 8. Client polls ``@rag-ask`` (GET) with job_id until completed Local Development ----------------- Start Ollama: .. code-block:: bash ollama serve ollama pull mxbai-embed-large ollama pull llama3.1:8b Enable RAG: .. code-block:: bash export RAG_ENABLED=true REST API -------- Ask a Question (Async) ^^^^^^^^^^^^^^^^^^^^^^ The ``@rag-ask`` endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter. .. code-block:: javascript // Submit question (async) const response = await fetch('/Plone/@rag-ask', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ question: 'What are the opening hours?', path: '/plone/section' // optional: filter to specific section }) }); const data = await response.json(); // data.status === 'pending', data.job_id === 'ragask_abc123' **Pending response:** .. code-block:: json { "@id": "http://localhost:8080/Plone/@rag-ask", "status": "pending", "job_id": "ragask_abc123", "question": "What are the opening hours?" } **Poll for result:** .. code-block:: javascript // Poll until completed const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', { headers: { 'Accept': 'application/json' } }); const result = await pollResponse.json(); **Completed response:** .. code-block:: json { "@id": "http://localhost:8080/Plone/@rag-ask", "status": "completed", "question": "What are the opening hours?", "answer": "Based on the information...", "sources": [ { "title": "Contact", "path": "/contact", "portal_type": "Contact", "score": 0.92, "chunk_index": 0, "snippet": "Opening hours: Monday to Friday..." } ] } Ask a Question (Sync) ^^^^^^^^^^^^^^^^^^^^^ Logged-in users can request synchronous processing by adding ``"sync": true``. This bypasses the queue and returns the answer directly. .. code-block:: javascript const response = await fetch('/Plone/@rag-ask', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ question: 'What are the opening hours?', sync: true }) }); const data = await response.json(); **Sync response:** .. code-block:: json { "@id": "http://localhost:8080/Plone/@rag-ask", "question": "What are the opening hours?", "answer": "Based on the information...", "sources": [ {"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0} ] } Path Filtering ^^^^^^^^^^^^^^ Use the ``path`` parameter to restrict search results to a specific section of the site: .. code-block:: javascript const response = await fetch('/Plone/@rag-ask', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ question: 'How do I configure this?', path: '/plone/documentation/admin-guide' }) }); This filters results to only include content under the specified path prefix. Reindex All Content ^^^^^^^^^^^^^^^^^^^ .. code-block:: javascript const response = await fetch('/Plone/@rag-index-all', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' } }); const data = await response.json(); **Response:** .. code-block:: json { "@id": "http://localhost:8080/Plone/@rag-index-all", "status": "queued", "queued_count": 150, "content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"] } Index Single Content ^^^^^^^^^^^^^^^^^^^^ .. code-block:: javascript const response = await fetch('/Plone/@rag-index', { method: 'POST', headers: { 'Accept': 'application/json', 'Content-Type': 'application/json' }, body: JSON.stringify({ uid: 'content-uid-here', async: true // default: true }) }); const data = await response.json(); **Response:** .. code-block:: json { "@id": "http://localhost:8080/Plone/@rag-index", "status": "queued", "uid": "content-uid-here" } Check Status ^^^^^^^^^^^^ .. code-block:: javascript const response = await fetch('/Plone/@rag-status', { headers: { 'Accept': 'application/json' } }); const data = await response.json(); **Response:** .. code-block:: json { "@id": "http://localhost:8080/Plone/@rag-status", "enabled": true, "index_name": "plone-rag-chunks", "exists": true, "chunk_count": 1250, "parent_count": 150, "dimensions": 1024 }