In a previous post, I built a basic Retrieval Augmented Generation (RAG) pipeline using the Star Trek Memory Alpha wiki as a knowledge base. The system worked, questions were inserted, relevant chunks were retrieved from a vector database, and a local LLM generated answers grounded in that retrieved context.
However, the quality of the answers was inconsistent.
Sometimes the RAG retrieved good quality information and assembled excellent answers. Other times it returned unrelated text chunks that caused the model to generate incomplete or inaccurate responses.

This highlighted an important lesson when building RAG systems
Answer quality is limited by retrieval quality.
Improving the retriever stage is an effective way to increase the accuracy and relevance of a RAG system.
In this post, I will explore some advanced retrieval techniques (collected from research and this book) and how they can improve a RAG system built on top of the Memory Alpha dataset from the Star Trek universe.
Why Retrieval Matters in RAG
A RAG pipeline typically looks like this.

The LLM itself does not inherently know the answer. Instead, it depends on the retrieved documents to feed the relevant context. For enterprise data this is obvious, but for more general topics, like Star Trek, even a local model can already have some knowledge.
If the retriever returns the wrong documents, the model has no chance of producing a good answer.
For example, if a user asks: What powers does Q have?* and the retriever returns generic or unrelated data about Q that does not explicitly answers the question, the model will struggle to answer accurately.
*Yeah this requires some Star Trek knowledge to understand the question 🙂
Advanced Retrieval Techniques
Below are several techniques that can significantly improve the quality of retrieved context.
Metadata Filtering
Metadata filtering narrows the search space by applying constraints to the vector query.
When documents are stored in a vector database, they can include metadata such as: article title, series, year, etc.
Instead of searching the entire database, we can restrict queries to documents matching certain metadata attributes.
Example Query: What happened to the Enterprise in The Next Generation?
Metadata filter: series = “The Next Generation”
This prevents the system from retrieving documents related to other Star Trek series.
This is all great on paper, but my RAG only has one metadata field – Title, so while it can still be useful , it’s not going to be a massive improvement, and it is something that I already had from the earlier implementation.
Multi Query Retrieval
Vector search depends heavily on wording so any small variation on the phrasing can produce different embeddings, and thus, different responses.
Multi query retrieval can solve or mitigate this by generating several alternative queries.
Example: Who created the Borg?
Alternative queries might include:
- What is the origin of the Borg?
- How were the Borg created?
- What species founded the Borg collective?
Each query performs a separate search, and the resulting documents/chunks are merged.
This technique makes sure that important documents are not missed because the wording differs. On the down side, it affects latency due to an extra LLM call to generate those alternative queries.
HyDE (Hypothetical Document Embeddings)
HyDE improves retrieval by generating a hypothetical answer before performing the search.
Instead of embedding the original question, the system first asks the LLM to generate a short paragraph that might answer the question. This generated text is then embedded and used as the vector search query.
This one feels very counter intuitive, specially when using small local models – how can a potential hallucination improve the final result? This was my first concern, but in reality it turns out that a potential hallucination would not be passed to the user, but only as search parameters to the vector database, so worst case scenario there are not relevant similar matches to the input.
Example query: How does warp drive work?
The local LLM model will generate a small paragraph explaining warp drive, and then use that to imprive the semantic search. The idea here is due to the fact that a richer context produces better search results in the semantic search.
Query Routing
Query routing directs different types of questions to specialized retrieval systems.
Instead of searching a single database, queries may be routed to different indexes.
For example:
Character questions → character database
Technology questions → technology articles
Episode questions → episode summaries
A query like:
Who is Jean-Luc Picard?
could be routed directly to a character index containing pages about Jean-Luc Picard. This approach improves retrieval precision and scalability for large knowledge bases.
Auto-Merging Retriever
When documents are chunked for embedding, important context can become fragmented.
An auto-merging retriever detects when multiple chunks come from the same source document and merges them back together before sending them to the LLM.
For example, a long article about the Dominion War might be split into several chunks.
If multiple chunks are retrieved, the retriever merges them to restore a coherent section of the original article.
This produces more complete and understandable context.
Sentence Window Retrieval
A sentence window retriever focuses on retrieving specific sentences and expanding the surrounding context.
Instead of retrieving an entire chunk, the system finds the most relevant sentence and retrieves a small window of sentences around it.
Example concept:
Genesis Device
If the key sentence describing the Genesis Device is retrieved, the retriever may include several surrounding sentences to ensure the LLM receives a complete explanation.
This approach improves factual grounding while maintaining precise retrieval.
Reranking
Vector similarity is not always the best indicator of relevance.
Reranking adds a second evaluation stage after the initial retrieval.
The process works like this:
- Retrieve the top 20 candidate documents using vector search
- Use a reranker model to score each document
- Return the top 3–5 most relevant documents
Because rerankers evaluate the query and document together, they often produce more accurate relevance scores.
For example:
What powers does Q have?
A reranker will prioritize documents specifically describing Q’s abilities rather than general mentions of the character.
Relevant entity: Q
In many RAG systems, reranking provides one of the largest improvements in retrieval quality.
Query Decomposition
Some user questions are too complex to retrieve context with a single search.
Query decomposition breaks complex questions into smaller sub-queries.
Example question:
What caused the Dominion War and how did it end?
This might be decomposed into:
- What caused the Dominion War?
- What events occurred during the Dominion War?
- How did the war end?
Relevant entity: Dominion War
Each sub-query retrieves relevant documents, and the LLM synthesizes the final answer.
This approach significantly improves responses to multi-part questions.
Which Techniques Matter Most?
While all of these strategies can improve retrieval, some provide larger gains than others.
I did not try them all, so I asked ChatGPT to lis them most relevant ones:
- Reranking – dramatically improves top results
- Multi-Query Retrieval – increases recall
- Query Decomposition – improves complex question handling
Together, these techniques can transform a basic RAG system into a much more robust knowledge assistant.
Conclusion
A simple RAG pipeline can work surprisingly well, but its limitations quickly become apparent when answering complex or ambiguous questions.
Improving the retriever is often the most impactful way to increase the quality of a RAG system. Techniques such as reranking, multi-query retrieval, and query decomposition can significantly improve the relevance and completeness of retrieved context.
For a knowledge base like the Memory Alpha wiki from the Star Trek universe, these techniques help ensure that the system retrieves the most relevant pieces of information before generating an answer.
Leave a Reply