Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B
🤗 Model
📊 Dataset
💻 CodeLarge language models have made it surprisingly easy to build systems that can answer technical questions. However, adapting these models to specialized domains—such as computer science interview questions—remains an open challenge.
Two common strategies are widely used today:
Fine-tuning the model on domain-specific data
Retrieval-Augmented Generation (RAG), where relevant context is retrieved and injected into the prompt before generation
Fine-tuning modifies the model’s internal parameters to better match the target domain. RAG, on the other hand, keeps the model unchanged but augments the input with retrieved knowledge.
This raises an interesting question:
Which approach works better for technical question answering?
To explore this, I conducted a small experiment using Mistral-7B-Instruct, comparing four configurations:
Vanilla Mistral
RAG + Vanilla
LoRA Fine-Tuned
RAG + Fine-Tuned
The results were not entirely what I expected.
Building a Technical QA Dataset
The first step was constructing a dataset of technical question-answer pairs covering core computer science topics such as:
Data structures
Algorithms
Operating systems
Databases
Computer networks
I began with a small seed dataset of roughly 200 curated interview questions. These were drawn from technical interview resources and existing open datasets.
To scale the dataset, I used Qwen to generate additional question-answer pairs. The model was prompted to produce variations of the seed questions while maintaining technical accuracy and domain relevance.
This synthetic expansion increased the dataset size to roughly 2,070 samples.
However, automatically generated data often contains redundancy, so several preprocessing steps were applied.
Dataset Cleaning and Filtering
Two filtering stages were used to improve dataset quality.
First, exact duplicates were removed by comparing normalized question strings. This removed 51 duplicate entries, leaving 2,019 samples.
Next, I performed semantic deduplication using sentence embeddings generated by MiniLM-L6-v2. For each question pair, cosine similarity was computed, and samples with similarity greater than 0.9 were considered paraphrases. In such cases, one of the duplicates was removed.
This process removed 213 additional samples, resulting in a final dataset of 1,806 unique question-answer pairs.
The dataset was then split into:
70% training data (1264 samples)
15% validation data (270 samples)
15% test data (272 samples)
The split used a fixed random seed to ensure reproducibility.
The final dataset is available on Kaggle.
Two Approaches to Domain Adaptation
With the dataset ready, I implemented two different strategies for adapting the model to technical question answering.
Approach 1: LoRA Fine-Tuning
The first approach used LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning technique.
Instead of updating all model weights, LoRA inserts small trainable matrices into the attention layers. This dramatically reduces the number of parameters that need to be trained while still allowing the model to adapt to new tasks.
The model was trained for three epochs on the training dataset using a learning rate of 1e-4. Because LoRA modifies only a small subset of parameters, the training process was relatively efficient even with a large base model.
The training loss curve showed steady convergence across the training steps.
Approach 2: Retrieval-Augmented Generation
The second approach was Retrieval-Augmented Generation (RAG).
Instead of modifying the model weights, RAG retrieves relevant context from a knowledge base and includes it in the prompt during generation.
The retrieval pipeline consisted of:
Sentence embeddings generated using MiniLM-L6-v2
A FAISS vector index for efficient similarity search
Retrieval of the top 5 most relevant context passages
During inference, each question was processed through the following pipeline:
Question
→ Embed the query
→ Retrieve similar QA examples
→ Inject retrieved context into the prompt
→ Generate the answer using the LLM
In theory, this allows the model to access domain-specific knowledge without requiring retraining.
Experimental Setup
To compare both strategies fairly, four model configurations were evaluated:
| Model | Description |
|---|---|
| Vanilla | Base Mistral-7B model |
| RAG + Vanilla | Retrieval-augmented inference |
| Fine-Tuned | LoRA fine-tuned model |
| RAG + Fine-Tuned | Retrieval combined with the fine-tuned model |
Evaluation was performed on the held-out test set using four metrics:
BLEU-4 — measures n-gram overlap
ROUGE-L — captures structural similarity
BERTScore — measures semantic similarity using contextual embeddings
Exact Match — checks whether the generated answer exactly matches the reference
Results
The evaluation results are summarized below.
| Model | BLEU-4 | ROUGE-L | BERTScore |
|---|---|---|---|
| Vanilla | 0.027 | 0.213 | 0.929 |
| RAG + Vanilla | 0.051 | 0.298 | 0.890 |
| Fine-Tuned | 0.056 | 0.287 | 0.889 |
| RAG + Fine-Tuned | 0.038 | 0.252 | 0.871 |
At first glance, fine-tuning appears competitive because it achieves the highest BLEU-4 score. However, a closer look at the semantic metric (BERTScore) reveals a different story.
The RAG + Vanilla configuration produced the most semantically aligned responses overall, suggesting that retrieval helped ground the model’s answers in relevant context.
A Qualitative Example
Consider the question:
“Implement a function to check if a binary tree is balanced.”
The reference answer describes using a recursive function to compute subtree heights and check whether the height difference exceeds one.
The RAG + Fine-Tuned model produced an unrelated explanation about binary search trees and hash tables.
In contrast, RAG + Vanilla generated a correct recursive approach, describing how to compute subtree heights and verify balance conditions.
This pattern appeared multiple times in the evaluation results.
The Fine-Tuning Paradox
One of the most interesting findings was what I refer to as the fine-tuning paradox.
Fine-tuning improved certain lexical metrics, such as BLEU and ROUGE, but sometimes degraded semantic accuracy. In several cases, the fine-tuned model produced answers that were grammatically correct yet conceptually incorrect.
This behavior resembles catastrophic forgetting, where the model loses some of its general knowledge while adapting to a narrower dataset.
Because the fine-tuning dataset was relatively small, the model may have overfit to specific phrasing patterns rather than deeper conceptual understanding.
Why Retrieval Worked Better
Retrieval-augmented generation offers a different advantage: it does not modify the model’s internal knowledge.
Instead, it provides relevant context dynamically during inference.
This has several benefits:
The base model retains its general reasoning ability
Answers are grounded in retrieved domain knowledge
The system can easily incorporate new data without retraining
For technical domains where precise definitions matter, this approach appears particularly effective.
Final Thoughts
This experiment suggests that retrieval-based approaches may be more reliable than aggressive fine-tuning for technical question answering.
While fine-tuning can improve surface-level metrics, retrieval provides a more robust mechanism for grounding model responses in relevant knowledge.
In practice, the RAG + Vanilla configuration offered the best balance of accuracy and reliability.
Resources
GitHub Repository : https://github.com/AtulDeshpande09/rag-technical-qa
HuggingFace Model (fine-tuned) : https://huggingface.co/AtulDeshpande/mistral-interview-assistant
Kaggle Dataset : https://www.kaggle.com/datasets/atuldeshpande96/technical-question-answering-dataset




