In a previous post, I built a basic Retrieval Augmented Generation (RAG) pipeline using the Star Trek Memory Alpha wiki as a knowledge base. The system worked, questions were inserted, relevant chunks were retrieved from a vector database, and a local LLM generated answers grounded in that retrieved context.

However, the quality of the answers was inconsistent.

Sometimes the RAG retrieved good quality information and assembled excellent answers. Other times it returned unrelated text chunks that caused the model to generate incomplete or inaccurate responses.

This highlighted an important lesson when building RAG systems

Answer quality is limited by retrieval quality.

Improving the retriever stage is an effective way to increase the accuracy and relevance of a RAG system.

In this post, I will explore some advanced retrieval techniques (collected from research and this book) and how they can improve a RAG system built on top of the Memory Alpha dataset from the Star Trek universe.

Why Retrieval Matters in RAG

A RAG pipeline typically looks like this.

The LLM itself does not inherently know the answer. Instead, it depends on the retrieved documents to feed the relevant context. For enterprise data this is obvious, but for more general topics, like Star Trek, even a local model can already have some knowledge.

If the retriever returns the wrong documents, the model has no chance of producing a good answer.

For example, if a user asks: What powers does Q have?* and the retriever returns generic or unrelated data about Q that does not explicitly answers the question, the model will struggle to answer accurately.

*Yeah this requires some Star Trek knowledge to understand the question 🙂

Advanced Retrieval Techniques

Below are several techniques that can significantly improve the quality of retrieved context.

Metadata Filtering

Metadata filtering narrows the search space by applying constraints to the vector query.

When documents are stored in a vector database, they can include metadata such as: article title, series, year, etc.

Instead of searching the entire database, we can restrict queries to documents matching certain metadata attributes.

Example Query: What happened to the Enterprise in The Next Generation?

Metadata filter: series = “The Next Generation”

This prevents the system from retrieving documents related to other Star Trek series.

This is all great on paper, but my RAG only has one metadata field – Title, so while it can still be useful , it’s not going to be a massive improvement, and it is something that I already had from the earlier implementation.

Multi Query Retrieval

Vector search depends heavily on wording so any small variation on the phrasing can produce different embeddings, and thus, different responses.

Multi query retrieval can solve or mitigate this by generating several alternative queries.

Example: Who created the Borg?

Alternative queries might include:

What is the origin of the Borg?
How were the Borg created?
What species founded the Borg collective?

Each query performs a separate search, and the resulting documents/chunks are merged.

This technique makes sure that important documents are not missed because the wording differs. On the down side, it affects latency due to an extra LLM call to generate those alternative queries.

HyDE (Hypothetical Document Embeddings)

HyDE improves retrieval by generating a hypothetical answer before performing the search.

Instead of embedding the original question, the system first asks the LLM to generate a short paragraph that might answer the question. This generated text is then embedded and used as the vector search query.

This one feels very counter intuitive, specially when using small local models – how can a potential hallucination improve the final result? This was my first concern, but in reality it turns out that a potential hallucination would not be passed to the user, but only as search parameters to the vector database, so worst case scenario there are not relevant similar matches to the input.

Example query: How does warp drive work?

The local LLM model will generate a small paragraph explaining warp drive, and then use that to imprive the semantic search. The idea here is due to the fact that a richer context produces better search results in the semantic search.

Query Routing

Query routing directs different types of questions to specialized retrieval systems.

Instead of searching a single database, queries may be routed to different indexes.

For example:

Character questions → character database
Technology questions → technology articles
Episode questions → episode summaries

A query like:

Who is Jean-Luc Picard?

could be routed directly to a character index containing pages about Jean-Luc Picard. This approach improves retrieval precision and scalability for large knowledge bases.

Auto-Merging Retriever

When documents are chunked for embedding, important context can become fragmented.

An auto-merging retriever detects when multiple chunks come from the same source document and merges them back together before sending them to the LLM.

For example, a long article about the Dominion War might be split into several chunks.

If multiple chunks are retrieved, the retriever merges them to restore a coherent section of the original article.

This produces more complete and understandable context.

Sentence Window Retrieval

A sentence window retriever focuses on retrieving specific sentences and expanding the surrounding context.

Instead of retrieving an entire chunk, the system finds the most relevant sentence and retrieves a small window of sentences around it.

Example concept:

Genesis Device

If the key sentence describing the Genesis Device is retrieved, the retriever may include several surrounding sentences to ensure the LLM receives a complete explanation.

This approach improves factual grounding while maintaining precise retrieval.

Reranking

Vector similarity is not always the best indicator of relevance.

Reranking adds a second evaluation stage after the initial retrieval.

The process works like this:

Retrieve the top 20 candidate documents using vector search
Use a reranker model to score each document
Return the top 3–5 most relevant documents

Because rerankers evaluate the query and document together, they often produce more accurate relevance scores.

For example:

What powers does Q have?

A reranker will prioritize documents specifically describing Q’s abilities rather than general mentions of the character.

Relevant entity: Q

In many RAG systems, reranking provides one of the largest improvements in retrieval quality.

Query Decomposition

Some user questions are too complex to retrieve context with a single search.

Query decomposition breaks complex questions into smaller sub-queries.

Example question:

What caused the Dominion War and how did it end?

This might be decomposed into:

What caused the Dominion War?
What events occurred during the Dominion War?
How did the war end?

Relevant entity: Dominion War

Each sub-query retrieves relevant documents, and the LLM synthesizes the final answer.

This approach significantly improves responses to multi-part questions.

Which Techniques Matter Most?

While all of these strategies can improve retrieval, some provide larger gains than others.

I did not try them all, so I asked ChatGPT to lis them most relevant ones:

Reranking – dramatically improves top results
Multi-Query Retrieval – increases recall
Query Decomposition – improves complex question handling

Together, these techniques can transform a basic RAG system into a much more robust knowledge assistant.

Conclusion

A simple RAG pipeline can work surprisingly well, but its limitations quickly become apparent when answering complex or ambiguous questions.

Improving the retriever is often the most impactful way to increase the quality of a RAG system. Techniques such as reranking, multi-query retrieval, and query decomposition can significantly improve the relevance and completeness of retrieved context.

For a knowledge base like the Memory Alpha wiki from the Star Trek universe, these techniques help ensure that the system retrieves the most relevant pieces of information before generating an answer.

Tech Trantor

AI, LLMs, Agents, Cloud & Integration

Beyond Basic RAG: Advanced Retrieval Techniques for a Star Trek Knowledge Base

Why Retrieval Matters in RAG

Advanced Retrieval Techniques

Metadata Filtering

Multi Query Retrieval

HyDE (Hypothetical Document Embeddings)

Query Routing

Auto-Merging Retriever

Sentence Window Retrieval

Reranking

Query Decomposition

Which Techniques Matter Most?

Conclusion

1 Trackback / Pingback

Leave a Reply Cancel reply

Why Retrieval Matters in RAG

Advanced Retrieval Techniques

Metadata Filtering

Multi Query Retrieval

HyDE (Hypothetical Document Embeddings)

Query Routing

Auto-Merging Retriever

Sentence Window Retrieval

Reranking

Query Decomposition

Which Techniques Matter Most?

Conclusion

Share this:

Related Posts

Building a Practical RAG Pipeline with LangChain

Chunking Strategies for Better RAG Retrieval With Qdrant

Exploring Qdrant Cloud and Vector Database

1 Trackback / Pingback

Leave a Reply Cancel reply