Retrieval Augmented Generation (RAG) came out in a research paper in 2020 and addressed one of the earlier LLM limitations, to easily extend its knowledge base in a fast and inexpensive way. Up until then, if one needed to extend the knowledge of a LLM it would required fine tuning and extra training which is an expensive and lengthy process. Also given these models were shared ones, it prevented companies from adding their own data into the training as this was not really feasible.
RAG came to address those limitations by “appending” a new set of data from external data sources that the LLM could access without it being part of the training corpus. Win win situation as those external data sources could easily be managed without it impacting the LLM.
The pace at what LLMs are advancing makes RAG seem old, but it is actually (in my opinion) the solution with a best return (price/time) for business to get real value without losing control of their data. RAG is how you turn a generic, smart model into a specific company employee or assistant – funny how we stopped using the word chatbot.
The current state
I will be using a local model (llama 3.1) and Ollama for this exercise. This is a highly capable model for conversation (I have the instruct 8b), but will obviously fail on many facts and details, which we will cover with RAG.
If I ask about the FC Porto squad that won the Champions League in 2004 against Monaco , the results are nothing short of hallucination.

And every time i run the same question the answer is different, but always surfing the hallucination wave.

The ultimate proof of this model failure? It named a ‘Man of the Match’ a player that never even played for Porto.

Solution Diagram
We’ll build a simple local RAG pipeline using:
- Ollama → runs the LLM locally
- Llama 3.1 → generation model
- LangChain → orchestration
- Chroma → vector store
Below is an LLM generated diagram

Step 1 – Install Dependencies
pip3 install langchain langchain-community langchain-ollama chromadb beautifulsoup4 requests
*I have python3 hence the pip3
We also need (on top of llama3.1) the nomic-embed-text which will allow us to convert embeddings into vectors (more on this later)
ollama pull llama3.1
ollama pull nomic-embed-text
Step 2 — Load the data
The wikipedia has a page with all necessary information to address the GAP the model shows.
https://en.wikipedia.org/wiki/2004_UEFA_Champions_League_final
import os
import requests
import bs4
os.environ["USER_AGENT"] = "Mozilla/5.0"
# ── 1. Scrape Wikipedia ───────────────────────────────────────────────
print("Loading Wikipedia page...")
response = requests.get(
"https://en.wikipedia.org/wiki/2004_UEFA_Champions_League_final",
headers={"User-Agent": "Mozilla/5.0"}
)
soup = bs4.BeautifulSoup(response.text, "html.parser")
for tag in soup.find_all(["sup", "style", "script"]):
tag.decompose()
article = soup.select_one("#mw-content-text .mw-parser-output")
content = []
for el in article.find_all(["h2", "h3", "p", "li"]):
content.append(el.get_text(strip=True))
text = "\n".join(content)
print(text[:2000])
Basically we are scraping the entire html data, looking for paragraphs lists and headings and collecting all of it. The last line help us validate if we are getting any data, so you should see something like the below screenshot.

Step 3 — Split the document into chunks
In RAG the page is not embedded whole at once. Instead its split into smaller chunks so the retriever can find the most relevant pieces. We’ll use LangChain text splitter.
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Wrap text as a Document
doc = Document(page_content=text)
# Create splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100
)
# Split into chunks
chunks = text_splitter.split_documents([doc])
print(len(chunks))
print(chunks[0].page_content)
The chunk size is one of the important configuration parameters we can control. A smaller size will improve the accuracy but the it may dilute the context associated with that fact. This is not impactful for this small exercise.
Step 4 — Create embeddings
We will generate embeddings using nomic-embed-text through Ollama.
from langchain_community.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="llama3.1")
Embedding is the process of converting text into numbers (or vectors), where that numeric representation can be use to map meaning proximity. This is a concept we used to cover in the pre-historical era of NLP chatbots, where we created the below diagram to visually explain embeddings.
You can use these extra code lines to verify that the embeddings did work.
vector = embeddings.embed_query("The final was played in Gelsenkirchen")
print("Vector length:", len(vector))
print("First 10 values:", vector[:10])
After this step the chunks are now represented numerically

Step 5 — Store chunks in a vector database
I will use Chroma, a lightweight vector database that runs locally and works well with LangChain.
The vector database will store: the chunk text the embedding vector and an ID
from langchain_community.vectorstores import Chroma
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
I realized that at this point I was missing the chromadb package so I had to install it with
pip3 install chromadb
Step 6 — Create the retriever
The retriever searches the vector database for the most relevant chunks.
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3}
)
print("Retriever ready")
Step 7 — Ask a question
Ok, let’s plug it all in and ask a direct question.
llm = ChatOllama(
model="llama3.1",
temperature=0
)
response = llm.invoke("Describe the match in the Champions League final in 2004")
print(response.content)

What a night and day difference for the previously generated answer. This is based on the data that we keep with RAG – quite an impressive improvement with such little effort/cost.

Leave a Reply