From Fandom to RAG: Building a Complete Star Trek Assistant in Python

In a previous post I covered how Retrieval Augmented Generation (RAG) can be used to extend a Large Language Model (LLM) knowledge beyond what it learned during training. That example was quite simple and just did the minimum basic steps to have a working RAG.

In this post, I will take things a step further.

Instead of working with a small dataset, I’ll build a much larger knowledge base by ingesting the complete set of Star Trek information from the massive Memory Alpha Fandom wiki. This site contains tens of thousands of pages covering characters, species, starships, technologies, and episodes across the entire franchise.

Using RAG, we can transform a local model into a specialized Star Trek expert, a Digital Data-like capable of handling all Star Trek questions by retrieving relevant information directly from the wiki rather than relying only on what the model knows from training, which for a local model will not be sufficient.

RAG Recap

The best way to visualize RAG is to mentally park the LLM aside and think only on the (vector) database with the knowledge we need stored. When the user asks a question we search if the database contains any info related with the question and pass the question, a prompt and the related documents to the LLM to generate the “grounded” response.

Below is the most simple diagram I found to describe this pipeline.

https://deepchecks.com/glossary/agentic-rag/

There is obviously much more behind this, specially in regards to sourcing the data (all kinds of different formats), cleaning, splitting the data into chunks, creating embeddings, storing all the info in the vector database, creating the retriever and many more, but the core principle is the one the above diagram shows.

Getting The Data

The main source of data for this project will be https://memory-alpha.fandom.com/ which is a collaborative wiki created by fans for everything Star Trek related. It contains 64k articles and 69k files.

In here, we can find text in headers, paragraphs, tables, images, and all sorts of other things for which we need to create a extraction solution or at least have in mind what we need from the site.

This contains all the information we need for this RAG, however its a HTML source which will contain a mixture of, headers, paragraphs, tables, images, captions, hyperlinks, file names and much more useless pieces of data I am sure.

Attempt 1: Scraping the Wiki

My first idea was to use a proper python package, BeautifulSoup, to scrap the data, but I found a better approach by using the MediaWiki API provided by the site itself, which gives direct access to the page content and thus avoiding some of the messy HTML.

The following code defines the API endpoint, an output file, and a parallel worker setup to speed up the extraction process.

...

API = "https://memory-alpha.fandom.com/api.php"
OUTPUT_FILE = "memory_alpha.jsonl"

MAX_WORKERS = 20

def get_all_pages():
    pages = []
    params = {
        "action": "query",
        "list": "allpages",
        "aplimit": "500",
        "format": "json"
    }
...

Like this I can retrieve all pages in the wiki and save the extracted content into a JSONL file.

After the full run, the result was a ~200 MB JSONL file containing the raw content of the entire wiki.

On my M3 Mac, the process took roughly 20–30 minutes to complete.

The Data Quality Problem

I had all the data I needed, but it was messy – just visually looking at the jsonl file one could see lots of pollution. My experience in data integration has taught me one thing: Garbage In, Garbage Out. A RAG system is only as good as its knowledge base.

Properly cleaning this would require more effort than I was prepared to spend, but luckily I found a clean data set in HuggingFace for the Memory Alpha page. The only apparent downside here is the cut date being 2023, but I can live with that 🙂

The only missing dependency was a package that allows us to download public datasets.

pip3 install datasets

This was fairly easy with the below lines.

from datasets import load_dataset

# Load the entire dataset 
dataset = load_dataset("emergentorder/StarTrekMemoryAlpha20230216")

# Access the training split
train_data = dataset["train"]

# Look at the first entry
print(train_data[0])

The data looks good, however it still has plenty of punctuation marks and escape characters that we need to parse out.

Cleaning and Chunking

Before generating embeddings, the dataset needs to pass two important pre-processing steps, Cleaning and then Chunking. Even when using a relatively clean dataset (like this one from Hugging Face), there is always improvements that can be made.

In this case, cleaning will focus on removing pieces of data that are not relevant for our RAG, like punctuation marks, escape characters and things like external links, references, redirections etc.

The goal here is simply to extract the meaningful text content that will later be used for semantic search. Example of the type of code that I will use

    # Normalize whitespace
    text = re.sub(r"\n+", " ", text)
    text = re.sub(r"\s{2,}", " ", text)
    return text.strip()
    # remove line breaks
    text = re.sub(r"\n+", " ", text)
   

At the end we need to create chunks, which is splitting the data into more manageable pieces (with some overlapping to not miss anything)

Embedding an entire wiki page at once would not work due to context window of the model being too small, and we would not be able to feed the proper content to the LLM, just the entire document which would be too generic.

Instead, we split each document into smaller pieces that preserve semantic meaning while staying within model limits.

At this point we have a new file memory_alpha_clean_chunks.jsonl that looks like this.

To start with will create chunks of 300 words with an chunk_overlap of 50.

Preparing For Embeddings

The embeddings will pick up the generated chunks and turn that text into numbers, so they can be represented in the vector space, which in turn allow us to take text from a new user input and directly see the proximity and semantic similarity with existing vector data.

We also need to choose one embedding model and I will use nomic-embed-text, for no other reason that it was the one suggested to me by Claude 🙂

with open("memory_alpha_clean_chunks.jsonl") as f:

    for i, line in enumerate(f):

        data = json.loads(line)

        text = data["text"]
        title = data["title"]

        embedding = ollama.embeddings(
            model="nomic-embed-text",
            prompt=text
        )["embedding"]

        collection.add(
            ids=[str(i)],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{"title": title}]
        )

        

This took quite some time to complete – and it threw an error when chunks were being larger than the embedding model context window. The below code (albeit not a clean solution) did solve the issue. Pretty much it skips those chunks if they are larger than the context window.

        except Exception as e:
            print("Skipping chunk", i, e)
            continue

Retriever Mechanism

When a user asks a question, the Retriever finds the most relevant “chunks” from the Vector DB based on semantic similarity.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.get_collection("memory_alpha")

results = collection.query(
    query_texts=["What is the USS Enterprise"],
    n_results=5
)

for doc in results["documents"][0]:
    print(doc[:300])

Then the output from this semantic search (which will be a max of 5 results) is combined with the RAG prompt and sent to the LLM.

prompt = f"""
Answer the question using the context below.

Context:
{context}

Question:
What is the USS Enterprise?

Answer:
"""

This is what we will pass to the LLM – the documents results plus the prompt so that the LLM can generate the response.

context = "\n\n".join(results["documents"][0])

Below the final request to llama3.1


response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": prompt}
    ]
)

print(response["message"]["content"])

Disappointing Results

The outcome of this first execution we nothing short of disappointing.

Note that first I print out the retrieved chunks, and none of them has any meaningful information about the USS Enterprise.

This means I need to revisit the chunk/split step. There is a lot more going on here, and a lot of decisions regarding what to include in the chunks (the title at the begging to help categorize the chunk), the chunk size, and clearly removing anything that is not clear text. This is something to focus on a later post.

Second Iteration

I tried a new embedding mode – mxbai-embed-large – reduced the chunk size and included a minimum size as well (which will remove small/empty chunks). This was enough to start producing better results.

As said before there are many factors to consider when deciding an architecture for a RAG. I am jumping loops thru all of those at the moment, as the goal is to quickly complete the pipeline.

Adding an UI

The last piece is to add a simple UI for a better user experience. Given I am not a particular big fan of Frontend development, I used Claude to assist me on this step, and in just a couple of minutes, with the help of Streamlit I could launch a supporting UI for the LLM + RAG pipeline.

Final Notes

The idea of RAG is quite straightforward: ingest data, create embeddings, store them in a vector database, and retrieve relevant chunks to augment the LLM’s responses. In practice, however, the quality of the results depends heavily on the quality of the data preparation steps, and of the architecture decisions along the way.

In this exercise the first implementation technically worked but the answers were poor because the retrieved context was not actually relevant to the question (in fact it was not relevant for any question)

This highlighted one of the most important lessons when building RAG systems – retrieval quality is everything. If the retriever fails to surface the right context, the LLM will struggle to provide a good answer, specially given that I am using local LLM’s which are good but not the most capable.

Improving the chunking strategy, cleaning the data more carefully, and including additional metadata such as the page title improved the relevance of the retrieved results.

Going forward I can think in exploring more the retrieval mechanism, the prompt strategy and also the embedding strategy.

Be the first to comment

Leave a Reply

Your email address will not be published.


*