RAG (Retrieval-Augmented Generation)
=====================================

Question-answering system using vector embeddings and hybrid search.

Glossary
--------

.. list-table::
   :widths: 20 80
   :header-rows: 1

   * - Term
     - Description
   * - **RAG**
     - Retrieval-Augmented Generation. Technique that enhances LLM responses by first retrieving relevant documents from a knowledge base, then using them as context for answer generation. Reduces hallucinations and grounds answers in actual content.
   * - **Embedding**
     - A vector (list of numbers) representing text in a high-dimensional space. Similar texts have similar vectors, enabling semantic search. Our embeddings have 1024 dimensions.
   * - **Vector Search / kNN**
     - k-Nearest Neighbors search. Finds documents whose embedding vectors are closest to the query vector using cosine similarity. Captures semantic meaning beyond keyword matching.
   * - **BM25**
     - Best Match 25. Traditional text search algorithm that ranks documents by term frequency and inverse document frequency. Good for exact keyword matches.
   * - **Hybrid Search**
     - Combines vector search (semantic) with BM25 (keyword) for better results. Vector search finds conceptually similar content; BM25 finds exact term matches.
   * - **Chunk**
     - A segment of text from a larger document. Long documents are split into overlapping chunks (default: 1000 chars with 200 char overlap) so each chunk fits in the embedding model's context and retrieval is more precise.
   * - **LLM**
     - Large Language Model. AI model trained on text that can generate human-like responses. Used here to formulate answers based on retrieved context (Ollama/Mistral).
   * - **Ollama**
     - Local LLM runtime for running open-source models. Used for development with models like ``llama3.1:8b`` (chat) and ``mxbai-embed-large`` (embeddings).
   * - **Cosine Similarity**
     - Measure of similarity between two vectors (0 to 1). Value of 1 means identical direction, 0 means orthogonal. Used to find semantically similar text chunks.
   * - **Dense Vector**
     - Elasticsearch field type for storing embedding vectors. Enables efficient approximate nearest neighbor (ANN) search at scale.
   * - **Context Window**
     - Maximum amount of text an LLM can process at once. Retrieved chunks are concatenated as context for the LLM, limited by ``RAG_MAX_CONTEXT_LENGTH``.
   * - **Top-K**
     - Number of most relevant results to return from search. Higher values provide more context but may include less relevant content.
   * - **num_candidates**
     - Elasticsearch kNN parameter controlling how many candidates to consider before selecting top-K. Higher values improve accuracy but slow down search.


Architecture
------------

.. code-block:: text

    +------------------+     +------------------+     +------------------+
    |   Plone CMS      |     |   Redis/RQ       |     |  Elasticsearch   |
    |                  |     |   Worker         |     |                  |
    |  - REST API      |---->|  - Embedding     |---->|  - Chunks Index  |
    |  - Subscribers   |     |    Generation    |     |  - kNN + BM25    |
    +------------------+     |  - RAG Ask       |     +------------------+
                             +------------------+
                                      |
                                      v
                             +------------------+
                             |  Ollama/Mistral  |
                             |  - Embeddings    |
                             |  - LLM Chat      |
                             +------------------+


Components
----------

REST API Endpoints
^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: 25 15 60
   :header-rows: 1

   * - Endpoint
     - Method
     - Description
   * - ``@rag-ask``
     - POST
     - Ask a question (async by default, sync with ``"sync": true`` for logged-in users)
   * - ``@rag-ask``
     - GET
     - Poll for async result by job_id
   * - ``@rag-status``
     - GET
     - RAG system status and statistics
   * - ``@rag-index``
     - POST
     - Manually trigger embedding for single content
   * - ``@rag-index-all``
     - POST
     - Reindex all RAG-enabled content


Modules
^^^^^^^

- **config.py** - Environment-based configuration and registry access
- **client.py** - OpenAI-compatible API client for embeddings and LLM
- **chunks.py** - Elasticsearch chunks index management
- **index.py** - Content embedding and hybrid search interface
- **tasks.py** - Redis/RQ background tasks for async embedding and question-answering
- **subscribers.py** - Zope event handlers for automatic indexing


Configuration
-------------

Environment Variables
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

    # Feature flag
    RAG_ENABLED=true

    # Embedding API (Ollama local / Mistral production)
    EMBEDDING_BASE_URL=http://localhost:11434/v1
    EMBEDDING_API_KEY=ollama
    EMBEDDING_MODEL=mxbai-embed-large
    EMBEDDING_DIMENSIONS=1024

    # LLM API (Ollama local / Mistral production)
    LLM_BASE_URL=http://localhost:11434/v1
    LLM_API_KEY=ollama
    LLM_MODEL=llama3.1:8b

    # Chunking
    RAG_CHUNK_SIZE=1000
    RAG_CHUNK_OVERLAP=200

    # Search
    RAG_TOP_K=5
    RAG_NUM_CANDIDATES=100


Plone Registry
^^^^^^^^^^^^^^

Content types and LLM behavior are configured via Plone registry:

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - Record
     - Description
   * - ``wcs.backend.rag_content_types``
     - Content types to index (default: ContentPage, News, File, Contact, Book, Chapter)
   * - ``wcs.backend.rag_system_prompt``
     - System prompt for the LLM
   * - ``wcs.backend.rag_no_answer_message``
     - Message when no relevant context is found
   * - ``wcs.backend.rag_error_message``
     - Message when answer generation fails
   * - ``wcs.backend.rag_llm_temperature``
     - LLM temperature (0.0 = deterministic, 1.0 = creative)
   * - ``wcs.backend.rag_llm_max_tokens``
     - Maximum tokens in LLM response
   * - ``wcs.backend.rag_boost_bm25``
     - BM25 text search boost factor
   * - ``wcs.backend.rag_boost_knn``
     - kNN vector search boost factor
   * - ``wcs.backend.rag_title_boost_factor``
     - Additional boost for title matches
   * - ``wcs.backend.rag_score_high_threshold``
     - Minimum score for high confidence results
   * - ``wcs.backend.rag_score_medium_threshold``
     - Minimum score for medium confidence results


Elasticsearch Index
-------------------

Separate index ``{plone-index}-rag-chunks`` with mapping:

.. code-block:: json

    {
      "properties": {
        "chunk_id": {"type": "keyword"},
        "parent_uid": {"type": "keyword"},
        "parent_title": {"type": "text"},
        "parent_path": {"type": "keyword"},
        "portal_type": {"type": "keyword"},
        "allowedRolesAndUsers": {"type": "keyword"},
        "chunk_index": {"type": "integer"},
        "chunk_text": {"type": "text"},
        "embedding": {
          "type": "dense_vector",
          "dims": 1024,
          "index": true,
          "similarity": "cosine"
        }
      }
    }


Data Flow
---------

Indexing (Async via Worker)
^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Content created/modified triggers ``on_content_added``/``on_content_modified``
2. ``queue_embedding_task`` queues job with ES settings and index name
3. Worker fetches content via ``fetch_data`` (REST API)
4. Text split into overlapping chunks
5. Embeddings generated via OpenAI-compatible API
6. Chunks indexed to Elasticsearch


Question-Answering (Async via Worker)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. ``@rag-ask`` (POST) receives question from client
2. Job ID generated from question hash + user security context + optional path
3. If cached result exists, return immediately
4. Otherwise, queue ``rag_ask_task`` to Redis worker
5. Return pending status with job_id for polling
6. Worker performs hybrid search and generates LLM answer
7. Result cached in Redis (TTL: 5 minutes)
8. Client polls ``@rag-ask`` (GET) with job_id until completed


Local Development
-----------------

Start Ollama:

.. code-block:: bash

    ollama serve
    ollama pull mxbai-embed-large
    ollama pull llama3.1:8b


Enable RAG:

.. code-block:: bash

    export RAG_ENABLED=true


REST API
--------

Ask a Question (Async)
^^^^^^^^^^^^^^^^^^^^^^

The ``@rag-ask`` endpoint processes questions asynchronously by default, allowing the frontend to poll for results without blocking. Results are cached based on the question, user permissions, and optional path filter.

.. code-block:: javascript

    // Submit question (async)
    const response = await fetch('/Plone/@rag-ask', {
        method: 'POST',
        headers: {
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            question: 'What are the opening hours?',
            path: '/plone/section'  // optional: filter to specific section
        })
    });
    const data = await response.json();
    // data.status === 'pending', data.job_id === 'ragask_abc123'

**Pending response:**

.. code-block:: json

    {
      "@id": "http://localhost:8080/Plone/@rag-ask",
      "status": "pending",
      "job_id": "ragask_abc123",
      "question": "What are the opening hours?"
    }

**Poll for result:**

.. code-block:: javascript

    // Poll until completed
    const pollResponse = await fetch('/Plone/@rag-ask?job_id=ragask_abc123', {
        headers: { 'Accept': 'application/json' }
    });
    const result = await pollResponse.json();

**Completed response:**

.. code-block:: json

    {
      "@id": "http://localhost:8080/Plone/@rag-ask",
      "status": "completed",
      "question": "What are the opening hours?",
      "answer": "Based on the information...",
      "sources": [
        {
          "title": "Contact",
          "path": "/contact",
          "portal_type": "Contact",
          "score": 0.92,
          "chunk_index": 0,
          "snippet": "Opening hours: Monday to Friday..."
        }
      ]
    }


Ask a Question (Sync)
^^^^^^^^^^^^^^^^^^^^^

Logged-in users can request synchronous processing by adding ``"sync": true``. This bypasses the queue and returns the answer directly.

.. code-block:: javascript

    const response = await fetch('/Plone/@rag-ask', {
        method: 'POST',
        headers: {
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            question: 'What are the opening hours?',
            sync: true
        })
    });
    const data = await response.json();

**Sync response:**

.. code-block:: json

    {
      "@id": "http://localhost:8080/Plone/@rag-ask",
      "question": "What are the opening hours?",
      "answer": "Based on the information...",
      "sources": [
        {"title": "Contact", "path": "/contact", "score": 0.92, "chunk_index": 0}
      ]
    }


Path Filtering
^^^^^^^^^^^^^^

Use the ``path`` parameter to restrict search results to a specific section of the site:

.. code-block:: javascript

    const response = await fetch('/Plone/@rag-ask', {
        method: 'POST',
        headers: {
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            question: 'How do I configure this?',
            path: '/plone/documentation/admin-guide'
        })
    });

This filters results to only include content under the specified path prefix.


Reindex All Content
^^^^^^^^^^^^^^^^^^^

.. code-block:: javascript

    const response = await fetch('/Plone/@rag-index-all', {
        method: 'POST',
        headers: {
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        }
    });
    const data = await response.json();

**Response:**

.. code-block:: json

    {
      "@id": "http://localhost:8080/Plone/@rag-index-all",
      "status": "queued",
      "queued_count": 150,
      "content_types": ["ContentPage", "News", "File", "Contact", "Book", "Chapter"]
    }


Index Single Content
^^^^^^^^^^^^^^^^^^^^

.. code-block:: javascript

    const response = await fetch('/Plone/@rag-index', {
        method: 'POST',
        headers: {
            'Accept': 'application/json',
            'Content-Type': 'application/json'
        },
        body: JSON.stringify({
            uid: 'content-uid-here',
            async: true  // default: true
        })
    });
    const data = await response.json();

**Response:**

.. code-block:: json

    {
      "@id": "http://localhost:8080/Plone/@rag-index",
      "status": "queued",
      "uid": "content-uid-here"
    }


Check Status
^^^^^^^^^^^^

.. code-block:: javascript

    const response = await fetch('/Plone/@rag-status', {
        headers: { 'Accept': 'application/json' }
    });
    const data = await response.json();

**Response:**

.. code-block:: json

    {
      "@id": "http://localhost:8080/Plone/@rag-status",
      "enabled": true,
      "index_name": "plone-rag-chunks",
      "exists": true,
      "chunk_count": 1250,
      "parent_count": 150,
      "dimensions": 1024
    }