Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B

🤗 Model
📊 Dataset
💻 Code

Large language models have made it surprisingly easy to build systems that can answer technical questions. However, adapting these models to specialized domains—such as computer science interview questions—remains an open challenge.

Two common strategies are widely used today:

Fine-tuning the model on domain-specific data
Retrieval-Augmented Generation (RAG), where relevant context is retrieved and injected into the prompt before generation

Fine-tuning modifies the model’s internal parameters to better match the target domain. RAG, on the other hand, keeps the model unchanged but augments the input with retrieved knowledge.

This raises an interesting question:

Which approach works better for technical question answering?

To explore this, I conducted a small experiment using Mistral-7B-Instruct, comparing four configurations:

Vanilla Mistral
RAG + Vanilla
LoRA Fine-Tuned
RAG + Fine-Tuned

The results were not entirely what I expected.

Building a Technical QA Dataset

The first step was constructing a dataset of technical question-answer pairs covering core computer science topics such as:

Data structures
Algorithms
Operating systems
Databases
Computer networks

I began with a small seed dataset of roughly 200 curated interview questions. These were drawn from technical interview resources and existing open datasets.

To scale the dataset, I used Qwen to generate additional question-answer pairs. The model was prompted to produce variations of the seed questions while maintaining technical accuracy and domain relevance.

This synthetic expansion increased the dataset size to roughly 2,070 samples.

However, automatically generated data often contains redundancy, so several preprocessing steps were applied.

Dataset Cleaning and Filtering

Two filtering stages were used to improve dataset quality.

First, exact duplicates were removed by comparing normalized question strings. This removed 51 duplicate entries, leaving 2,019 samples.

Next, I performed semantic deduplication using sentence embeddings generated by MiniLM-L6-v2. For each question pair, cosine similarity was computed, and samples with similarity greater than 0.9 were considered paraphrases. In such cases, one of the duplicates was removed.

This process removed 213 additional samples, resulting in a final dataset of 1,806 unique question-answer pairs.

The dataset was then split into:

70% training data (1264 samples)
15% validation data (270 samples)
15% test data (272 samples)

The split used a fixed random seed to ensure reproducibility.

The final dataset is available on Kaggle.

Two Approaches to Domain Adaptation

With the dataset ready, I implemented two different strategies for adapting the model to technical question answering.

Approach 1: LoRA Fine-Tuning

The first approach used LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning technique.

Instead of updating all model weights, LoRA inserts small trainable matrices into the attention layers. This dramatically reduces the number of parameters that need to be trained while still allowing the model to adapt to new tasks.

The model was trained for three epochs on the training dataset using a learning rate of 1e-4. Because LoRA modifies only a small subset of parameters, the training process was relatively efficient even with a large base model.

The training loss curve showed steady convergence across the training steps.

Approach 2: Retrieval-Augmented Generation

The second approach was Retrieval-Augmented Generation (RAG).

Instead of modifying the model weights, RAG retrieves relevant context from a knowledge base and includes it in the prompt during generation.

The retrieval pipeline consisted of:

Sentence embeddings generated using MiniLM-L6-v2
A FAISS vector index for efficient similarity search
Retrieval of the top 5 most relevant context passages

During inference, each question was processed through the following pipeline:

Question
→ Embed the query
→ Retrieve similar QA examples
→ Inject retrieved context into the prompt
→ Generate the answer using the LLM

In theory, this allows the model to access domain-specific knowledge without requiring retraining.

Experimental Setup

To compare both strategies fairly, four model configurations were evaluated:

Model	Description
Vanilla	Base Mistral-7B model
RAG + Vanilla	Retrieval-augmented inference
Fine-Tuned	LoRA fine-tuned model
RAG + Fine-Tuned	Retrieval combined with the fine-tuned model

Evaluation was performed on the held-out test set using four metrics:

BLEU-4 — measures n-gram overlap
ROUGE-L — captures structural similarity
BERTScore — measures semantic similarity using contextual embeddings
Exact Match — checks whether the generated answer exactly matches the reference

Results

The evaluation results are summarized below.

Model	BLEU-4	ROUGE-L	BERTScore
Vanilla	0.027	0.213	0.929
RAG + Vanilla	0.051	0.298	0.890
Fine-Tuned	0.056	0.287	0.889
RAG + Fine-Tuned	0.038	0.252	0.871

At first glance, fine-tuning appears competitive because it achieves the highest BLEU-4 score. However, a closer look at the semantic metric (BERTScore) reveals a different story.

The RAG + Vanilla configuration produced the most semantically aligned responses overall, suggesting that retrieval helped ground the model’s answers in relevant context.

A Qualitative Example

Consider the question:

“Implement a function to check if a binary tree is balanced.”

The reference answer describes using a recursive function to compute subtree heights and check whether the height difference exceeds one.

The RAG + Fine-Tuned model produced an unrelated explanation about binary search trees and hash tables.

In contrast, RAG + Vanilla generated a correct recursive approach, describing how to compute subtree heights and verify balance conditions.

This pattern appeared multiple times in the evaluation results.

The Fine-Tuning Paradox

One of the most interesting findings was what I refer to as the fine-tuning paradox.

Fine-tuning improved certain lexical metrics, such as BLEU and ROUGE, but sometimes degraded semantic accuracy. In several cases, the fine-tuned model produced answers that were grammatically correct yet conceptually incorrect.

This behavior resembles catastrophic forgetting, where the model loses some of its general knowledge while adapting to a narrower dataset.

Because the fine-tuning dataset was relatively small, the model may have overfit to specific phrasing patterns rather than deeper conceptual understanding.

Why Retrieval Worked Better

Retrieval-augmented generation offers a different advantage: it does not modify the model’s internal knowledge.

Instead, it provides relevant context dynamically during inference.

This has several benefits:

The base model retains its general reasoning ability
Answers are grounded in retrieved domain knowledge
The system can easily incorporate new data without retraining

For technical domains where precise definitions matter, this approach appears particularly effective.

Final Thoughts

This experiment suggests that retrieval-based approaches may be more reliable than aggressive fine-tuning for technical question answering.

While fine-tuning can improve surface-level metrics, retrieval provides a more robust mechanism for grounding model responses in relevant knowledge.

In practice, the RAG + Vanilla configuration offered the best balance of accuracy and reliability.

Resources

GitHub Repository : https://github.com/AtulDeshpande09/rag-technical-qa
HuggingFace Model (fine-tuned) : https://huggingface.co/AtulDeshpande/mistral-interview-assistant
Kaggle Dataset : https://www.kaggle.com/datasets/atuldeshpande96/technical-question-answering-dataset

Atul Deshpande

Thursday, March 5, 2026

Why RAG Beat Fine-Tuning for Technical Question Answering