RAG-Tutorial

Learn RAG from scratch!

Lecture Notes for This Youtube Class.[1]

Basic Knowledge

Retrieval-Augmented Generation (RAG) is a hybrid framework that combines the strengths of retrieval-based and generative models for natural language processing tasks. The basic principle of RAG involves two key components: a retriever and a generator.

The retriever is responsible for fetching relevant documents or information from a large corpus or knowledge base, typically using dense vector representations to find the most pertinent data for a given query. This ensures that the model has access to accurate, up-to-date external information beyond its training data.

The generator, usually a pre-trained language model, then uses this retrieved information to produce coherent and contextually appropriate responses. By grounding its outputs in retrieved facts, RAG reduces hallucinations and improves factual accuracy compared to purely generative models.

Basic Steps:

  • Load documents using bs4 using Crawler.

  • Split text using function of RecursiveCharacterTextSplitter() from langchain.text_splitter.

  • Embedding splits into vectors, using Chroma from langchain_community.vectorstores and OpenAIEmbeddings and return the vector store retriever.

  • Define RAG-chain (including prompt and LLM)

    1
    2
    3
    4
    5
    6
    7
    # Chain
    rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
    )
  • Using the invoke function to get the generation text!

Query-Translation

Multi Query

Intuition: For a large problem, it is likely to encounter situations where “words fail to convey meaning,” and if the answer to the problem is relatively complex, a simple RAG system may not achieve good results through direct vector comparison. Therefore, we can decompose the problem and use parallel retrieval.

How to make the splitting works? You can use Prompt Engineering and add templates:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from langchain.prompts import ChatPromptTemplate

# Multi Query: Different Perspectives
template = """You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

generate_queries = (
prompt_perspectives
| ChatOpenAI(temperature=0,
api_key= openai.api_key,
base_url= BaseUrl)
| StrOutputParser()
| (lambda x: x.split("\n"))
)

In this code, we define the workchain of generate_queries, where we let OpenAI model to split question using prompts.

This is a demo of how AI generate:

1
2
3
4
5
1. How do LLM agents utilize task decomposition in their operations?
2. Can you explain the concept of task decomposition as applied to LLM agents?
3. What role does task decomposition play in the functioning of LLM agents?
4. How is task decomposition integrated into the workflow of LLM agents?
5. In what way does task decomposition enhance the capabilities of LLM agents?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from langchain.load import dumps, loads

def get_unique_union(documents: list[list]):
""" Unique union of retrieved docs """
# Flatten list of lists, and convert each Document to string
flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
# Get unique documents
unique_docs = list(set(flattened_docs))
# Return
return [loads(doc) for doc in unique_docs]

# Retrieve
question = "What is task decomposition for LLM agents?"
retrieval_chain = generate_queries | retriever.map() | get_unique_union
docs = retrieval_chain.invoke({"question":question})
len(docs)

print(docs)

The core chain of retrieval is retrieval_chain = generate_queries | retriever.map() | get_unique_union, where the input questions is first splitted into subquestions using genegrate_queries and through a mapping from the queries to the retriever (for every subquestion, search for the ans_vectors). Finally, using get_unique_union to get the unique answer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from operator import itemgetter
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough

# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(temperature=0,
api_key= openai.api_key,
base_url= BaseUrl)

final_rag_chain = (
{"context": retrieval_chain, "question": itemgetter("question")}
| prompt
| llm
| StrOutputParser()
)

final_rag_chain.invoke({"question":question})

Finally, it is the RAG time! The final_rag-chain looks as follows:

1
2
3
4
5
6
final_rag_chain = (
{"context": retrieval_chain, "question": itemgetter("question")}
| prompt
| llm
| StrOutputParser()
)

We first use retrieval_chain defined above to get the related texts in the documents, add use the itemgetter to get the original problem. Then we feed these into a LLM with the predefined prompt. Finally, we use StrOutputParser() to get the final answer of LLM.

RAG Fusions

References


RAG-Tutorial
https://xiyuanyang-code.github.io/posts/RAG-tutorial/
Author
Xiyuan Yang
Posted on
March 29, 2025
Updated on
March 31, 2025
Licensed under