Thursday, March 5, 2026

Why RAG Beat Fine-Tuning for Technical Question Answering

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B

🤗 Model
📊 Dataset
💻 Code

Large language models have made it surprisingly easy to build systems that can answer technical questions. However, adapting these models to specialized domains—such as computer science interview questions—remains an open challenge.

Two common strategies are widely used today:

Fine-tuning the model on domain-specific data
Retrieval-Augmented Generation (RAG), where relevant context is retrieved and injected into the prompt before generation

Fine-tuning modifies the model’s internal parameters to better match the target domain. RAG, on the other hand, keeps the model unchanged but augments the input with retrieved knowledge.

This raises an interesting question:

Which approach works better for technical question answering?

To explore this, I conducted a small experiment using Mistral-7B-Instruct, comparing four configurations:

Vanilla Mistral
RAG + Vanilla
LoRA Fine-Tuned
RAG + Fine-Tuned

The results were not entirely what I expected.

Building a Technical QA Dataset

The first step was constructing a dataset of technical question-answer pairs covering core computer science topics such as:

Data structures
Algorithms
Operating systems
Databases
Computer networks

I began with a small seed dataset of roughly 200 curated interview questions. These were drawn from technical interview resources and existing open datasets.

To scale the dataset, I used Qwen to generate additional question-answer pairs. The model was prompted to produce variations of the seed questions while maintaining technical accuracy and domain relevance.

This synthetic expansion increased the dataset size to roughly 2,070 samples.

However, automatically generated data often contains redundancy, so several preprocessing steps were applied.

Dataset Cleaning and Filtering

Two filtering stages were used to improve dataset quality.

First, exact duplicates were removed by comparing normalized question strings. This removed 51 duplicate entries, leaving 2,019 samples.

Next, I performed semantic deduplication using sentence embeddings generated by MiniLM-L6-v2. For each question pair, cosine similarity was computed, and samples with similarity greater than 0.9 were considered paraphrases. In such cases, one of the duplicates was removed.

This process removed 213 additional samples, resulting in a final dataset of 1,806 unique question-answer pairs.

The dataset was then split into:

70% training data (1264 samples)
15% validation data (270 samples)
15% test data (272 samples)

The split used a fixed random seed to ensure reproducibility.

The final dataset is available on Kaggle.

Two Approaches to Domain Adaptation

With the dataset ready, I implemented two different strategies for adapting the model to technical question answering.

Approach 1: LoRA Fine-Tuning

The first approach used LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning technique.

Instead of updating all model weights, LoRA inserts small trainable matrices into the attention layers. This dramatically reduces the number of parameters that need to be trained while still allowing the model to adapt to new tasks.

The model was trained for three epochs on the training dataset using a learning rate of 1e-4. Because LoRA modifies only a small subset of parameters, the training process was relatively efficient even with a large base model.

The training loss curve showed steady convergence across the training steps.

Approach 2: Retrieval-Augmented Generation

The second approach was Retrieval-Augmented Generation (RAG).

Instead of modifying the model weights, RAG retrieves relevant context from a knowledge base and includes it in the prompt during generation.

The retrieval pipeline consisted of:

Sentence embeddings generated using MiniLM-L6-v2
A FAISS vector index for efficient similarity search
Retrieval of the top 5 most relevant context passages

During inference, each question was processed through the following pipeline:

Question
→ Embed the query
→ Retrieve similar QA examples
→ Inject retrieved context into the prompt
→ Generate the answer using the LLM

In theory, this allows the model to access domain-specific knowledge without requiring retraining.

Experimental Setup

To compare both strategies fairly, four model configurations were evaluated:

Model	Description
Vanilla	Base Mistral-7B model
RAG + Vanilla	Retrieval-augmented inference
Fine-Tuned	LoRA fine-tuned model
RAG + Fine-Tuned	Retrieval combined with the fine-tuned model

Evaluation was performed on the held-out test set using four metrics:

BLEU-4 — measures n-gram overlap
ROUGE-L — captures structural similarity
BERTScore — measures semantic similarity using contextual embeddings
Exact Match — checks whether the generated answer exactly matches the reference

Results

The evaluation results are summarized below.

Model	BLEU-4	ROUGE-L	BERTScore
Vanilla	0.027	0.213	0.929
RAG + Vanilla	0.051	0.298	0.890
Fine-Tuned	0.056	0.287	0.889
RAG + Fine-Tuned	0.038	0.252	0.871

At first glance, fine-tuning appears competitive because it achieves the highest BLEU-4 score. However, a closer look at the semantic metric (BERTScore) reveals a different story.

The RAG + Vanilla configuration produced the most semantically aligned responses overall, suggesting that retrieval helped ground the model’s answers in relevant context.

A Qualitative Example

Consider the question:

“Implement a function to check if a binary tree is balanced.”

The reference answer describes using a recursive function to compute subtree heights and check whether the height difference exceeds one.

The RAG + Fine-Tuned model produced an unrelated explanation about binary search trees and hash tables.

In contrast, RAG + Vanilla generated a correct recursive approach, describing how to compute subtree heights and verify balance conditions.

This pattern appeared multiple times in the evaluation results.

The Fine-Tuning Paradox

One of the most interesting findings was what I refer to as the fine-tuning paradox.

Fine-tuning improved certain lexical metrics, such as BLEU and ROUGE, but sometimes degraded semantic accuracy. In several cases, the fine-tuned model produced answers that were grammatically correct yet conceptually incorrect.

This behavior resembles catastrophic forgetting, where the model loses some of its general knowledge while adapting to a narrower dataset.

Because the fine-tuning dataset was relatively small, the model may have overfit to specific phrasing patterns rather than deeper conceptual understanding.

Why Retrieval Worked Better

Retrieval-augmented generation offers a different advantage: it does not modify the model’s internal knowledge.

Instead, it provides relevant context dynamically during inference.

This has several benefits:

The base model retains its general reasoning ability
Answers are grounded in retrieved domain knowledge
The system can easily incorporate new data without retraining

For technical domains where precise definitions matter, this approach appears particularly effective.

Final Thoughts

This experiment suggests that retrieval-based approaches may be more reliable than aggressive fine-tuning for technical question answering.

While fine-tuning can improve surface-level metrics, retrieval provides a more robust mechanism for grounding model responses in relevant knowledge.

In practice, the RAG + Vanilla configuration offered the best balance of accuracy and reliability.

Resources

GitHub Repository : https://github.com/AtulDeshpande09/rag-technical-qa
HuggingFace Model (fine-tuned) : https://huggingface.co/AtulDeshpande/mistral-interview-assistant
Kaggle Dataset : https://www.kaggle.com/datasets/atuldeshpande96/technical-question-answering-dataset

Tuesday, September 9, 2025

Do Google Search Results Differ by Location?

Do Google Search Results Change with Location? — SERP Location Effects

Do Google Search Results Change with Location?

Experiment date: 2025-09-03 • Countries tested: US, India, Japan, Brazil, Germany

Introduction

Question: If we type the same query into Google from different locations, do we get the same answers?

Short answer: No. Search Engine Result Pages (SERPs) vary by many factors — location is a major one. For example, "best food" searched in Mumbai will return different results than the same query from Tokyo.

In this post I describe a small experiment: I ran the same queries from five countries, compared results using Jaccard similarity, and visualized the differences with heatmaps.

Experiment setup

I tested 5 countries and 9 queries grouped into three categories:

Countries: United States, India, Japan, Brazil, Germany
Query categories (3 each):
- Neutral: climate change facts, latest AI research, mathematicians
- Cultural: best food, cultural values, nationalist
- Commercial: best laptop, study abroad programs, buy electronics online

All searches were performed on Google using SerpAPI. Each query-location pair used the top 10 organic results. The experiment metadata (timestamps, location, search IDs) is available here: metadata.csv.

Method: For each pair of locations we computed the Jaccard similarity of their top-10 result sets:

Jaccard = (A ∩ B) / (A ∪ B)

The result is a 5×5 symmetric matrix (locations × locations) for each query.

Results (high level)

Overall, the data shows location affects search results. The effect is strongest for Commercial and Cultural queries and weaker for Neutral queries.

Cultural queries — example: cultural values

Below is the heatmap of Jaccard similarity for the query cultural values. Higher values mean more overlap in the top-10 results between two countries.

Figure 1: Heatmap for *cultural values* (higher = more overlap). Japan shows noticeably different results compared to other countries.

Heatmap of Jaccard similarity for cultural values — Figure 1: Heatmap for *cultural values* (higher = more overlap). Japan shows noticeably different results compared to other countries.

Commercial queries — example: best laptop

Commercial queries like "best laptop" show much less overlap between countries — many results differ because country-specific retailers, local review sites, and regionally-preferred brands appear.

Figure 2: Heatmap for *best laptop* — less overlap across countries than cultural or neutral queries.

Heatmap of Jaccard similarity for best laptop — Figure 2: Heatmap for *best laptop* — less overlap across countries than cultural or neutral queries.

Notes: Results above are computed from only the top 10 results per location. Expanding to top-50 would give a fuller picture, but is slower and requires more careful de-duplication.

Conclusion

This simple experiment supports the common belief: Google SERPs vary significantly by location, especially for queries influenced by culture and commerce.

Next question: can location-based results introduce bias? I plan a follow-up experiment to explore that — stay tuned.

If you'd like the raw data or the scripts I used (including the code that produced the heatmaps), you can find them in the repository: GitHub.

View code & data on GitHub

Wednesday, July 2, 2025

Tense-Dependent Subject Inflection in Marathi: A Hidden Challenge in Natural Language Generation

Tags: Marathi NLP, Morphological Analysis, Natural Language Generation, Subject Inflection, Low-Resource Languages, Rule-Based NLP, Indo-Aryan Languages

Introduction

When we think about building natural language generation (NLG) systems for Indian languages, we often focus on verb conjugation — especially for tense handling.

But for Marathi, a morphologically rich Indo-Aryan language, this isn't enough.

Why?
Because changing tense doesn't only affect the verb — it also changes the subject.

This blog explores a subtle but essential linguistic rule in Marathi that impacts sentence generation — and how ignoring it can lead to grammatically incorrect translations.

A Real Example from Marathi

Let’s say we want to generate the Marathi sentence for:

"She eats" → ती खाते ✅
(Here, the subject "ती" is in its nominative form.)

But now consider:

"She ate" → तीने खाल्ले ✅
The subject changes from "ती" to "तीने" — it's now ergative.

🧩 Most machine translation and NLG systems focus only on changing the verb (खाते → खाल्ले) — but completely miss the subject change (ती → तीने).

Why This Happens: Ergative Alignment in Marathi

Marathi uses a split-ergative grammar, meaning:

In present tense, the subject is nominative.
In past tense, the subject takes an ergative case marker (“ने”).

This is not an exception or irregularity.
It’s a core rule of the language, grounded in syntactic alignment.

🧠 Ergative alignment is also found in other Indo-Aryan languages like Hindi, Konkani, and Nepali.

Problem in NLP Systems

Most NLP generation pipelines — whether rule-based or neural — do not account for subject case marking that depends on tense. Here's what often goes wrong:

Incorrect Output: ती खाल्ले सफरचंद ❌
(Subject not in ergative case)
Correct Output: तीने सफरचंद खाल्ले ✅

This kind of mismatch affects:

Machine translation
Dialogue generation
Morphology-aware generation
Educational tools for language learning

Why This Is Important for Developers and Researchers

If you're working on:

Multilingual NLP
Low-resource language modeling
Morphological analyzers
Grammar-based generation systems

...then subject inflection in tense-sensitive contexts is something you can't ignore.

By capturing such language-specific rules, we can improve:

Fluency
Grammatical accuracy
Cultural authenticity of generated text

What I’m Working On

I’m currently implementing these improvements in my hybrid English-to-Marathi generator:

✅ Rule-based handling of subject inflection
✅ Integration of tense detection to trigger case marking
✅ Plan to extend to Hindi and Nepali for broader Indo-Aryan modeling

Takeaway

In Marathi, tense changes both the verb and the subject.
Ignoring this can lead to flawed, unnatural sentence generation.

Understanding this linguistic structure is not just about accuracy — it’s about respecting the depth of human language in machine models.

Friday, May 23, 2025

Learning Biological Semantics with BioSkipGram: A Deep Dive into Sequence Embeddings

Research Theme: Computational Biology, Sequence Modeling, Representation Learning
Focus: From-scratch Word2Vec (Skip-Gram) on Genomic Data

Rethinking Language Models for Genomics

Much like human languages, biological sequences carry rich, structured information. In natural language processing (NLP), word embeddings such as Word2Vec have revolutionized our ability to model context and meaning from unstructured text.

With BioSkipGram, I extended this concept to genomic data, building a skip-gram model from scratch in PyTorch, trained on coding DNA sequences (CDS) from Homo sapiens. Instead of words, the model learns embeddings for k-mers, the biological analogues of linguistic tokens.

Why This Matters

This project isn't just about modeling—it’s about discovering latent structure in DNA using ideas from language. It reflects a fundamental shift in how we approach biology:

Genes aren't just biochemical—they're computational.

With that lens, this project taught me to:

Map natural language concepts like context, semantics, and distributional similarity to DNA sequences.
Design a full learning pipeline, from data preprocessing to training and embedding visualization.
Understand how unsupervised representation learning uncovers subtle relationships, even in highly repetitive biological data.

Research Takeaways

Through this project, I developed more than just a working model. I gained meaningful insight into key ideas at the intersection of biology and machine learning:

1. Biological Semantics Are Contextual

Even in DNA, context matters. By training on sliding k-mer windows, the model captures recurring motifs and biological "phrases"—a principle shared with natural language semantics.

2. From NLP to Genomics: The Role of Transfer Learning

This project reaffirms that ideas born in NLP (like skip-gram and embedding vectors) are adaptable and powerful for biological sequence analysis. This strengthens the case for cross-domain learning paradigms.

3. Building from Scratch Builds Understanding

Implementing everything from the tokenizer to the training loop gave me an engineering-level appreciation of:

Negative sampling and sparse gradients
Vocabulary construction for biological tokens
Embedding space regularization and interpretability

This hands-on process significantly sharpened my ability to connect theoretical learning to real-world biological questions.

Sample Output

Below is a t-SNE projection of the learned k-mer embeddings. It reveals clustering behavior that hints at shared biological function or sequence origin:

Broader Impact

Projects like this are foundational for future research in:

Mutation detection (e.g., variant embeddings)
Protein-DNA interaction prediction
Functional annotation of non-coding regions
Custom embeddings for clinical genomics pipelines

As I continue my journey into bioinformatics, I see this project as a stepping stone toward more ambitious research—where computational abstraction meets biological function.

🔗 GitHub & Dataset

👉 GitHub Repository
📥 Homo sapiens CDS Dataset (NCBI)

Saturday, May 17, 2025

Simulating a Genetic Ring Oscillator: A Synthetic Biology Approach with Python

Field: Synthetic Biology, Computational Modeling
Focus: Genetic Circuits, PoPS Modeling, Numerical Integration

Overview

In this project, I simulate the behavior of a genetic ring oscillator, a synthetic circuit built from gene regulatory inverters. This work is inspired by the MIT 20.180: Biological Engineering Programming, which introduces the fundamentals of gene circuit modeling and protein regulation through computational tools.

The simulation is implemented entirely in Python using numerical methods, offering an educational and research-grade foundation for exploring time-based genetic dynamics.

Biological Background

Synthetic biology treats DNA-based logic circuits much like digital electronics. At the core of this simulation is the inverter, a genetic NOT gate. These gates are modeled using:

Protein Generator:

\frac{dR}{dt} = \text{ProductionRate} \times \text{PoPS}_{\text{in}} - k_d \cdot R

Where:

$R$ : Repressor protein concentration
$k_d = \frac{\ln 2}{t_{1/2}}$ : First-order decay rate
$\text{PoPS}_{\text{in}}$ : Polymerase per second signal (transcription rate)

PoPS Regulator:

\text{PoPS}_{\text{out}} = \text{PoPS}_{\text{max}} \cdot \frac{k_D}{k_D + R}

This models transcriptional repression where proteins bind DNA to reduce the outgoing signal.

Ring Oscillator Circuit

By chaining three inverters in a closed loop, we create a genetic ring oscillator. An odd number of NOT gates causes signal inversion to propagate over time, leading to oscillatory protein expression—a biological clock.

This behavior is central to many real-world synthetic constructs, including toggle switches and biostable memory units.

Implementation Highlights

Built in Python, simulating each inverter with its internal protein generator and repression logic.
Used Euler’s method for solving ODEs governing protein concentrations and signal flow.
Configurable parameters: half-life, production rate, repression constant $k_D$ , PoPS max, and time steps.
Simulates both steady state and transient oscillatory dynamics.

Example Output
Input Signal : 70.0 PoPS
Repressor Protein (Inverter 1): 1014.49

PoPS Out (Inverter 1): 0.0689
PoPS Out (Inverter 2): 35.0172
PoPS Out (Inverter 3): 0.1376
...

Conclusion

This project reflects a broader vision of programmable biology. By modeling genetic circuits computationally, we gain early insight into their behavior—empowering researchers and students to design biological systems with the same logic we use for digital machines.

This also opens pathways toward more advanced sequence-based modeling, which I plan to pursue as part of my future research in bioinformatics and computational genomics.

Wednesday, April 16, 2025

Understanding Topic Modeling with Latent Dirichlet Allocation (LDA)

📌 Understanding Topic Modeling with Latent Dirichlet Allocation (LDA)

🧠 Introduction

Topic modeling is a widely used technique in Natural Language Processing (NLP) that uncovers hidden thematic structures in a collection of documents. Among several approaches, Latent Dirichlet Allocation (LDA) remains one of the most prominent and interpretable probabilistic models.

In this post, we demonstrate a basic yet meaningful implementation of LDA, operating over a custom-defined corpus. While the dataset used is intentionally minimal for illustration, the underlying concepts scale well to real-world applications.

🗂️ Objective

To model topics within a small collection of text samples using LDA and observe how the model probabilistically distributes terms across discovered topics.

📚 Input Corpus

We begin with a simplified, tokenized corpus, where each document is represented as a list of words:

corpus = [
    ["apple", "banana", "apple", "fruit", "fruit", "banana"],
    ["dog", "cat", "dog", "animal", "pet", "cat"],
    ["banana", "fruit", "apple", "orange", "fruit", "banana"]
]

Here, two dominant domains are implied:

A fruit-related group of words
An animal/pet-related set

The intention is to assess if LDA can discover these latent groupings with minimal supervision.

🧪 Output (Sample Topic Distributions)

The LDA model assigns words to topics using probabilities. After training the model, the following topics emerged:

Topic 0

['0.999881*banana', '0.999881*fruit', '0.999822*apple', '0.965500*orange']

This topic predominantly encapsulates the fruit domain with a high concentration on terms like banana, fruit, apple, and orange.

Topic 1

['0.999918*dog', '0.999918*cat', '0.991821*animal', '0.991821*pet']

Topic 1 clearly captures the animal/pet domain, clustering relevant terms tightly together.

Note: Very low probability spillovers (on the order of 0.0001) are expected due to smoothing and the probabilistic nature of the model.

📌 Reflections

This experiment reaffirms LDA’s ability to distinguish distinct semantic domains, even with minimal data.
LDA models each document as a mixture of topics and each topic as a distribution over words, which makes it highly interpretable.
While this implementation focuses on a toy dataset, scaling this up with real-world corpora and preprocessing pipelines would allow for meaningful topic analysis in applications like content recommendation, academic article clustering, or trend detection in social media.

✅ Conclusion

Implementing topic modeling from scratch reinforces a solid understanding of probabilistic text analysis and paves the way for more sophisticated NLP pipelines. This milestone marks an important step in developing practical NLP skills with strong theoretical grounding.

Tuesday, March 25, 2025

TF-IDF: Enhancing Text Representation Beyond Bag of Words

Introduction

While the Bag of Words (BoW) model provides a simple way to represent text, it treats all words equally, failing to capture their importance in a document. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes into play. TF-IDF not only represents words numerically but also assigns weights based on their relevance within a corpus.

This article explores the concept, implementation, and significance of TF-IDF, building upon our previous work with BoW.

Understanding TF-IDF

TF-IDF is a weighting scheme that measures the importance of a word in a document relative to the entire collection of documents (corpus). It consists of two components:

Term Frequency (TF): Measures how often a word appears in a document.
$TF(w) = \frac{\text{Number of times } w \text{ appears in a document}}{\text{Total number of words in the document}}$
Inverse Document Frequency (IDF): Measures how rare a word is across all documents.
$IDF(w) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing } w}$

The final TF-IDF score is computed as:

TF\text{-}IDF(w) = TF(w) \times IDF(w)

Why TF-IDF Works Better Than BoW

Reduces the impact of frequent words like “the” and “is” by assigning lower weights.
Boosts important words that appear less frequently but are significant in meaning.
Enables better document comparison, improving text search and ranking.

Challenges & Limitations

Although TF-IDF improves text representation, it has its own drawbacks:

Ignores word meaning and order, failing to capture relationships like synonyms or context.
Sparse representation, making it inefficient for large corpora.
Sensitive to rare words, which may sometimes receive excessive weight.

To address these, advanced word embeddings such as Word2Vec, GloVe, and Transformer-based models (BERT, GPT) are used for semantic understanding.

Conclusion

The TF-IDF model enhances text representation by weighing words based on their importance in a document. Implementing it from scratch reinforces a deep understanding of feature extraction in NLP. With this foundation, we are now prepared to delve into vector-based representations for richer textual meaning.

Thursday, March 5, 2026

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B

Building a Technical QA Dataset

Dataset Cleaning and Filtering

Two Approaches to Domain Adaptation

Approach 1: LoRA Fine-Tuning

Approach 2: Retrieval-Augmented Generation

Experimental Setup

Results

A Qualitative Example

The Fine-Tuning Paradox

Why Retrieval Worked Better

Final Thoughts

Resources

Tuesday, September 9, 2025

Introduction

Experiment setup

Results (high level)

Cultural queries — example: cultural values

Commercial queries — example: best laptop

Conclusion

Wednesday, July 2, 2025

Introduction

A Real Example from Marathi

Why This Happens: Ergative Alignment in Marathi

Problem in NLP Systems

Why This Is Important for Developers and Researchers

What I’m Working On

Takeaway

Friday, May 23, 2025

Rethinking Language Models for Genomics

Why This Matters

Research Takeaways

1. Biological Semantics Are Contextual

2. From NLP to Genomics: The Role of Transfer Learning

3. Building from Scratch Builds Understanding

Sample Output

Broader Impact

🔗 GitHub & Dataset

Saturday, May 17, 2025

Overview

Biological Background

Ring Oscillator Circuit

Implementation Highlights

Example Output

Conclusion

Wednesday, April 16, 2025

📌 Understanding Topic Modeling with Latent Dirichlet Allocation (LDA)

🧠 Introduction

🗂️ Objective

📚 Input Corpus

🧪 Output (Sample Topic Distributions)

Topic 0

Topic 1

📌 Reflections

✅ Conclusion

Tuesday, March 25, 2025

Introduction

Understanding TF-IDF

Why TF-IDF Works Better Than BoW

Challenges & Limitations

Conclusion