RAG-Blog-Content-Retrieval

RAG Blog content retrieval

This is My github Project: RAG Blog Content Retrieval.

The contents following is copied from the README.md file:

😊Introduction

This is a simple application for blog content retrieval implemented using RAG as the core technology.

🚀Installation

Requirements

  • Make sure you have OpenAI API Key and LangSmith API key!

Install Python requirements

After cloning the project, run:

1
pip install -r requirements.txt

Set up .env file

Set up a file named .env in your current directory:

1
touch .env

Fill in your keys and other information in the .env file, a demo is shown below:

1
2
3
API_KEY=123456
BASE_URL=123456
Langchain_api=123456

Remember to replace your api key value into the key value!

💓Usage

Make sure you have passed through the installation section successfully.

Run Streamlit code Locally

run following commands:

1
streamlit run ./main.py

Then you can see the webpages like this:

demo

Then you can run this locally in your computer, enjoy RAG now!

A demo:

demo2

You can freely search send queries regarding the blog passage and get answers.

Run Streamlit demo online

I will finish it later…

But actually I don’t recommend this for it is unsafe to input your secret key to the internet!

🤖Discussion

Just for fun, don’t be serious.

If you have any issues, don’t hesitate to contact the author.

👍Advertisement

My personal Blog: Xiyuan Yang’s Blog

Code

I will demonstrate the codes here, which can be used like a demonstration for freshman to learn bs4, streamlit, langchain and RAG!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
'''
Author: Xiyuan Yang xiyuan_yang@outlook.com
Date: 2025-03-29 15:17:02
LastEditors: Xiyuan Yang xiyuan_yang@outlook.com
LastEditTime: 2025-03-29 16:26:27
FilePath: /RAG_try/RAG/main.py
Description:
Do you code and make progress today?
Copyright (c) 2025 by Xiyuan Yang, All Rights Reserved.
'''

# Several Requirements
import bs4
import dotenv
import openai
import os
import streamlit as st
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings



# getting environments done
dotenv.load_dotenv()
openai.api_key = os.getenv("API_KEY")
openai.base_url = os.getenv("BASE_URL")

# langchain api
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("Langchain_api")

# Define LLMs and prompts
LLM = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0,
api_key = openai.api_key,
base_url = openai.base_url)

prompt = hub.pull("rlm/rag-prompt")

# load documents:
def loader_documents(url):
loader = WebBaseLoader(
web_paths=(url,),

)
docs = loader.load()
return docs



def text_splitter(docs):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

return splits

def embedding(splits):
'''Embeding the vector'''
vectorstore = Chroma.from_documents(documents=splits,
embedding=OpenAIEmbeddings(
base_url="https://api.zhizengzeng.com/v1",
api_key=os.environ["OPENAI_API_KEY"]
))

retriever = vectorstore.as_retriever()

return vectorstore, retriever

# Post-processing
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)

def get_answer(retriever, prompt, llm, question):
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

answer = rag_chain.invoke(question)

return answer

def getstreamlit_UI(prompt, LLM):

st.title("RAG-Based Blog System")
st.subheader("Author: Xiyuan Yang (xiyuanyang-code)")
st.session_state["Prompt"] = prompt
st.session_state["LLM"] = LLM

# Introduction Part
st.write("## About this website")
st.write("This website is my **RAG implementation** for searching and retrieving my own blog posts, you can see my own blog posts\
and choose the blog url down here!")
st.write("My Blog posts: [xiyuanyang-code](https://xiyuanyang-code.github.io)")

# Choose website
st.write("default url: https://xiyuanyang-code.github.io/posts/Algorithm-BinaryTree/")
st.write("**Make sure your url is valid!**")
url = st.text_input("Choose your website: ")




default_url = "https://xiyuanyang-code.github.io/posts/Algorithm-BinaryTree/"
if st.button("Scrape Blog Content"):
with st.spinner("Scraping blog content..."):
blog_content = loader_documents(url)
if blog_content:
st.success("Blog content scraped successfully!")
st.session_state["blog_content"] = blog_content
else:
st.error("Failed to scrape blog content.",icon="🚨")
st.write("Using the default url")
blog_content = loader_documents(default_url)


if "blog_content" in st.session_state:
blog_content = st.session_state["blog_content"]
if st.button("Process and Store Content"):
with st.spinner("Processing and storing content..."):
splits = text_splitter(blog_content)
vectorstore, retriever = embedding(splits)
st.session_state["vectorstore"] = vectorstore
st.session_state["retriever"] = retriever
st.success("Content processed and stored in Chroma!")


if "vectorstore" in st.session_state:
vectorstore = st.session_state["vectorstore"]
retriever = st.session_state["retriever"]
prompt = st.session_state["Prompt"]
LLM = st.session_state["LLM"]

query = st.text_input("Ask a question about the blog:")
if query:
with st.spinner("Generating answer..."):
answer = get_answer(retriever=retriever, prompt=prompt, llm=LLM, question=query)
st.write("**Answer:**")
st.write(answer)

if __name__ == "__main__":
getstreamlit_UI(prompt=prompt, LLM=LLM)

That is Python! You can use code of less than 150 lines top build a complex RAG system, generating a neat Web UI interface and implement all the needs you want! What about C++? Maybe a guessing number game…

But C++ is very important too. Just for en


RAG-Blog-Content-Retrieval
https://xiyuanyang-code.github.io/posts/RAG-Blog-Content-Retrieval/
Author
Xiyuan Yang
Posted on
March 29, 2025
Updated on
March 29, 2025
Licensed under