RAG, or Retrieval Augmented Generation, is an architecture that connects a large language model with an external knowledge base, typically a vector database. Instead of relying only on parametric knowledge from training, the model retrieves relevant documents at query time and uses them to ground its response.
The pattern emerged in late 2023 and by 2026 is the dominant enterprise architecture for AI assistants, internal copilots, and domain-specific chatbots. Our 2026 data shows the full RAG cluster has a traffic potential of 39,000 across related queries.
A typical RAG system has five components: document ingestion and chunking, embedding generation, vector storage (Pinecone, Weaviate, Qdrant), similarity search at query time, and prompt construction that injects retrieved passages into the LLM context.
RAG reduces hallucinations because the model is constrained to cite source documents, and it enables AI to answer questions about proprietary or recent information the base model was not trained on.
How it works
When a user asks a question, the RAG system converts the query into an embedding, searches a vector database for the most similar passages, and constructs a prompt that includes both the question and retrieved context. The LLM then generates an answer grounded in the provided documents.
Practical example
A legal firm deploys a RAG system over 10 years of internal case memos. Associates ask questions in natural language and get cited answers with links to source documents. Research time drops from hours to minutes.
Definition by Miss Yera, Leading Woman in Technology in Peru · AI Consultant · Favikon 2025.
Version en espanol: /glosario-ia/#what-is-rag