Thursday, March 5, 2026

Why RAG Beat Fine-Tuning for Technical Question Answering

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B

🤗 Model
📊 Dataset
💻 Code

Large language models have made it surprisingly easy to build systems that can answer technical questions. However, adapting these models to specialized domains—such as computer science interview questions—remains an open challenge.

Two common strategies are widely used today:

Fine-tuning modifies the model’s internal parameters to better match the target domain. RAG, on the other hand, keeps the model unchanged but augments the input with retrieved knowledge.

This raises an interesting question:

Which approach works better for technical question answering?

To explore this, I conducted a small experiment using Mistral-7B-Instruct, comparing four configurations:

  1. Vanilla Mistral

  2. RAG + Vanilla

  3. LoRA Fine-Tuned

  4. RAG + Fine-Tuned

The results were not entirely what I expected.


Building a Technical QA Dataset

The first step was constructing a dataset of technical question-answer pairs covering core computer science topics such as:

  • Data structures

  • Algorithms

  • Operating systems

  • Databases

  • Computer networks

I began with a small seed dataset of roughly 200 curated interview questions. These were drawn from technical interview resources and existing open datasets.

To scale the dataset, I used Qwen to generate additional question-answer pairs. The model was prompted to produce variations of the seed questions while maintaining technical accuracy and domain relevance.

This synthetic expansion increased the dataset size to roughly 2,070 samples.

However, automatically generated data often contains redundancy, so several preprocessing steps were applied.


Dataset Cleaning and Filtering

Two filtering stages were used to improve dataset quality.

First, exact duplicates were removed by comparing normalized question strings. This removed 51 duplicate entries, leaving 2,019 samples.

Next, I performed semantic deduplication using sentence embeddings generated by MiniLM-L6-v2. For each question pair, cosine similarity was computed, and samples with similarity greater than 0.9 were considered paraphrases. In such cases, one of the duplicates was removed.

This process removed 213 additional samples, resulting in a final dataset of 1,806 unique question-answer pairs.

The dataset was then split into:

  • 70% training data (1264 samples)

  • 15% validation data (270 samples)

  • 15% test data (272 samples)

The split used a fixed random seed to ensure reproducibility.

The final dataset is available on Kaggle.


Two Approaches to Domain Adaptation

With the dataset ready, I implemented two different strategies for adapting the model to technical question answering.

Approach 1: LoRA Fine-Tuning

The first approach used LoRA (Low-Rank Adaptation), a parameter-efficient fine-tuning technique.

Instead of updating all model weights, LoRA inserts small trainable matrices into the attention layers. This dramatically reduces the number of parameters that need to be trained while still allowing the model to adapt to new tasks.

The model was trained for three epochs on the training dataset using a learning rate of 1e-4. Because LoRA modifies only a small subset of parameters, the training process was relatively efficient even with a large base model.

The training loss curve showed steady convergence across the training steps.


Approach 2: Retrieval-Augmented Generation

The second approach was Retrieval-Augmented Generation (RAG).

Instead of modifying the model weights, RAG retrieves relevant context from a knowledge base and includes it in the prompt during generation.

The retrieval pipeline consisted of:

  • Sentence embeddings generated using MiniLM-L6-v2

  • A FAISS vector index for efficient similarity search

  • Retrieval of the top 5 most relevant context passages

During inference, each question was processed through the following pipeline:

Question
→ Embed the query
→ Retrieve similar QA examples
→ Inject retrieved context into the prompt
→ Generate the answer using the LLM

In theory, this allows the model to access domain-specific knowledge without requiring retraining.


Experimental Setup

To compare both strategies fairly, four model configurations were evaluated:

ModelDescription
VanillaBase Mistral-7B model
RAG + VanillaRetrieval-augmented inference
Fine-TunedLoRA fine-tuned model
RAG + Fine-TunedRetrieval combined with the fine-tuned model

Evaluation was performed on the held-out test set using four metrics:

  • BLEU-4 — measures n-gram overlap

  • ROUGE-L — captures structural similarity

  • BERTScore — measures semantic similarity using contextual embeddings

  • Exact Match — checks whether the generated answer exactly matches the reference


Results

The evaluation results are summarized below.

ModelBLEU-4ROUGE-LBERTScore
Vanilla0.0270.2130.929
RAG + Vanilla0.0510.2980.890
Fine-Tuned0.0560.2870.889
RAG + Fine-Tuned0.0380.2520.871

At first glance, fine-tuning appears competitive because it achieves the highest BLEU-4 score. However, a closer look at the semantic metric (BERTScore) reveals a different story.


The RAG + Vanilla configuration produced the most semantically aligned responses overall, suggesting that retrieval helped ground the model’s answers in relevant context.


A Qualitative Example

Consider the question:

“Implement a function to check if a binary tree is balanced.”

The reference answer describes using a recursive function to compute subtree heights and check whether the height difference exceeds one.

The RAG + Fine-Tuned model produced an unrelated explanation about binary search trees and hash tables.

In contrast, RAG + Vanilla generated a correct recursive approach, describing how to compute subtree heights and verify balance conditions.

This pattern appeared multiple times in the evaluation results.


The Fine-Tuning Paradox

One of the most interesting findings was what I refer to as the fine-tuning paradox.

Fine-tuning improved certain lexical metrics, such as BLEU and ROUGE, but sometimes degraded semantic accuracy. In several cases, the fine-tuned model produced answers that were grammatically correct yet conceptually incorrect.

This behavior resembles catastrophic forgetting, where the model loses some of its general knowledge while adapting to a narrower dataset.

Because the fine-tuning dataset was relatively small, the model may have overfit to specific phrasing patterns rather than deeper conceptual understanding.

Why Retrieval Worked Better

Retrieval-augmented generation offers a different advantage: it does not modify the model’s internal knowledge.

Instead, it provides relevant context dynamically during inference.

This has several benefits:

  • The base model retains its general reasoning ability

  • Answers are grounded in retrieved domain knowledge

  • The system can easily incorporate new data without retraining

For technical domains where precise definitions matter, this approach appears particularly effective.


Final Thoughts

This experiment suggests that retrieval-based approaches may be more reliable than aggressive fine-tuning for technical question answering.

While fine-tuning can improve surface-level metrics, retrieval provides a more robust mechanism for grounding model responses in relevant knowledge.

In practice, the RAG + Vanilla configuration offered the best balance of accuracy and reliability.


Resources

GitHub Repository : https://github.com/AtulDeshpande09/rag-technical-qa
HuggingFace Model (fine-tuned) : https://huggingface.co/AtulDeshpande/mistral-interview-assistant
Kaggle Dataset : https://www.kaggle.com/datasets/atuldeshpande96/technical-question-answering-dataset

Tuesday, September 9, 2025

Do Google Search Results Differ by Location?

Do Google Search Results Change with Location? — SERP Location Effects

Do Google Search Results Change with Location?

Experiment date: 2025-09-03 • Countries tested: US, India, Japan, Brazil, Germany

Introduction

Question: If we type the same query into Google from different locations, do we get the same answers?

Short answer: No. Search Engine Result Pages (SERPs) vary by many factors — location is a major one. For example, "best food" searched in Mumbai will return different results than the same query from Tokyo.

In this post I describe a small experiment: I ran the same queries from five countries, compared results using Jaccard similarity, and visualized the differences with heatmaps.

Experiment setup

I tested 5 countries and 9 queries grouped into three categories:

  • Countries: United States, India, Japan, Brazil, Germany
  • Query categories (3 each):
    • Neutral: climate change facts, latest AI research, mathematicians
    • Cultural: best food, cultural values, nationalist
    • Commercial: best laptop, study abroad programs, buy electronics online

All searches were performed on Google using SerpAPI. Each query-location pair used the top 10 organic results. The experiment metadata (timestamps, location, search IDs) is available here: metadata.csv.

Method: For each pair of locations we computed the Jaccard similarity of their top-10 result sets:
Jaccard = (A ∩ B) / (A ∪ B)
The result is a 5×5 symmetric matrix (locations × locations) for each query.

Results (high level)

Overall, the data shows location affects search results. The effect is strongest for Commercial and Cultural queries and weaker for Neutral queries.

Cultural queries — example: cultural values

Below is the heatmap of Jaccard similarity for the query cultural values. Higher values mean more overlap in the top-10 results between two countries.

Heatmap of Jaccard similarity for cultural values
Figure 1: Heatmap for cultural values (higher = more overlap). Japan shows noticeably different results compared to other countries.

Commercial queries — example: best laptop

Commercial queries like "best laptop" show much less overlap between countries — many results differ because country-specific retailers, local review sites, and regionally-preferred brands appear.

Heatmap of Jaccard similarity for best laptop
Figure 2: Heatmap for best laptop — less overlap across countries than cultural or neutral queries.

Notes: Results above are computed from only the top 10 results per location. Expanding to top-50 would give a fuller picture, but is slower and requires more careful de-duplication.

Conclusion

This simple experiment supports the common belief: Google SERPs vary significantly by location, especially for queries influenced by culture and commerce.

Next question: can location-based results introduce bias? I plan a follow-up experiment to explore that — stay tuned.

If you'd like the raw data or the scripts I used (including the code that produced the heatmaps), you can find them in the repository: GitHub.

View code & data on GitHub

Questions or ideas? Reply below or open an issue on the GitHub repo. Feel free to try your own query — the correlation changes a lot depending on what you search for.

Wednesday, July 2, 2025

Tense-Dependent Subject Inflection in Marathi: A Hidden Challenge in Natural Language Generation

 


Tags: Marathi NLP, Morphological Analysis, Natural Language Generation, Subject Inflection, Low-Resource Languages, Rule-Based NLP, Indo-Aryan Languages


 Introduction

When we think about building natural language generation (NLG) systems for Indian languages, we often focus on verb conjugation — especially for tense handling.

But for Marathi, a morphologically rich Indo-Aryan language, this isn't enough.

Why?
Because changing tense doesn't only affect the verb — it also changes the subject.

This blog explores a subtle but essential linguistic rule in Marathi that impacts sentence generation — and how ignoring it can lead to grammatically incorrect translations.


 A Real Example from Marathi

Let’s say we want to generate the Marathi sentence for:

"She eats"ती खाते
(Here, the subject "ती" is in its nominative form.)

But now consider:

"She ate"तीने खाल्ले
The subject changes from "ती" to "तीने" — it's now ergative.

🧩 Most machine translation and NLG systems focus only on changing the verb (खाते → खाल्ले) — but completely miss the subject change (ती → तीने).


 Why This Happens: Ergative Alignment in Marathi

Marathi uses a split-ergative grammar, meaning:

  • In present tense, the subject is nominative.

  • In past tense, the subject takes an ergative case marker (“ने”).

This is not an exception or irregularity.
It’s a core rule of the language, grounded in syntactic alignment.

🧠 Ergative alignment is also found in other Indo-Aryan languages like Hindi, Konkani, and Nepali.


 Problem in NLP Systems

Most NLP generation pipelines — whether rule-based or neural — do not account for subject case marking that depends on tense. Here's what often goes wrong:

  • Incorrect Output: ती खाल्ले सफरचंद
    (Subject not in ergative case)

  • Correct Output: तीने सफरचंद खाल्ले

This kind of mismatch affects:

  • Machine translation

  • Dialogue generation

  • Morphology-aware generation

  • Educational tools for language learning


 Why This Is Important for Developers and Researchers

If you're working on:

  • Multilingual NLP

  • Low-resource language modeling

  • Morphological analyzers

  • Grammar-based generation systems

...then subject inflection in tense-sensitive contexts is something you can't ignore.

By capturing such language-specific rules, we can improve:

  • Fluency

  • Grammatical accuracy

  • Cultural authenticity of generated text


 What I’m Working On

I’m currently implementing these improvements in my hybrid English-to-Marathi generator:

✅ Rule-based handling of subject inflection
✅ Integration of tense detection to trigger case marking
✅ Plan to extend to Hindi and Nepali for broader Indo-Aryan modeling


 Takeaway

In Marathi, tense changes both the verb and the subject.
Ignoring this can lead to flawed, unnatural sentence generation.

Understanding this linguistic structure is not just about accuracy — it’s about respecting the depth of human language in machine models.


Friday, May 23, 2025

Learning Biological Semantics with BioSkipGram: A Deep Dive into Sequence Embeddings

Research Theme: Computational Biology, Sequence Modeling, Representation Learning
Focus: From-scratch Word2Vec (Skip-Gram) on Genomic Data


Rethinking Language Models for Genomics

Much like human languages, biological sequences carry rich, structured information. In natural language processing (NLP), word embeddings such as Word2Vec have revolutionized our ability to model context and meaning from unstructured text.

With BioSkipGram, I extended this concept to genomic data, building a skip-gram model from scratch in PyTorch, trained on coding DNA sequences (CDS) from Homo sapiens. Instead of words, the model learns embeddings for k-mers, the biological analogues of linguistic tokens.


Why This Matters

This project isn't just about modeling—it’s about discovering latent structure in DNA using ideas from language. It reflects a fundamental shift in how we approach biology:

Genes aren't just biochemical—they're computational.

With that lens, this project taught me to:

  • Map natural language concepts like context, semantics, and distributional similarity to DNA sequences.

  • Design a full learning pipeline, from data preprocessing to training and embedding visualization.

  • Understand how unsupervised representation learning uncovers subtle relationships, even in highly repetitive biological data.


Research Takeaways

Through this project, I developed more than just a working model. I gained meaningful insight into key ideas at the intersection of biology and machine learning:

1. Biological Semantics Are Contextual

Even in DNA, context matters. By training on sliding k-mer windows, the model captures recurring motifs and biological "phrases"—a principle shared with natural language semantics.

2. From NLP to Genomics: The Role of Transfer Learning

This project reaffirms that ideas born in NLP (like skip-gram and embedding vectors) are adaptable and powerful for biological sequence analysis. This strengthens the case for cross-domain learning paradigms.

3. Building from Scratch Builds Understanding

Implementing everything from the tokenizer to the training loop gave me an engineering-level appreciation of:

  • Negative sampling and sparse gradients

  • Vocabulary construction for biological tokens

  • Embedding space regularization and interpretability

This hands-on process significantly sharpened my ability to connect theoretical learning to real-world biological questions.


Sample Output

Below is a t-SNE projection of the learned k-mer embeddings. It reveals clustering behavior that hints at shared biological function or sequence origin:



Broader Impact

Projects like this are foundational for future research in:

  • Mutation detection (e.g., variant embeddings)

  • Protein-DNA interaction prediction

  • Functional annotation of non-coding regions

  • Custom embeddings for clinical genomics pipelines

As I continue my journey into bioinformatics, I see this project as a stepping stone toward more ambitious research—where computational abstraction meets biological function.


🔗 GitHub & Dataset

👉 GitHub Repository
📥 Homo sapiens CDS Dataset (NCBI)



Saturday, May 17, 2025

Simulating a Genetic Ring Oscillator: A Synthetic Biology Approach with Python

Field: Synthetic Biology, Computational Modeling
Focus: Genetic Circuits, PoPS Modeling, Numerical Integration


 Overview

In this project, I simulate the behavior of a genetic ring oscillator, a synthetic circuit built from gene regulatory inverters. This work is inspired by the MIT 20.180: Biological Engineering Programming, which introduces the fundamentals of gene circuit modeling and protein regulation through computational tools.

The simulation is implemented entirely in Python using numerical methods, offering an educational and research-grade foundation for exploring time-based genetic dynamics.


 Biological Background

Synthetic biology treats DNA-based logic circuits much like digital electronics. At the core of this simulation is the inverter, a genetic NOT gate. These gates are modeled using:

  1. Protein Generator:

dRdt=ProductionRate×PoPSinkdR\frac{dR}{dt} = \text{ProductionRate} \times \text{PoPS}_{\text{in}} - k_d \cdot R

Where:

  • RR: Repressor protein concentration

  • kd=ln2t1/2k_d = \frac{\ln 2}{t_{1/2}}: First-order decay rate

  • PoPSin\text{PoPS}_{\text{in}}: Polymerase per second signal (transcription rate)

  1. PoPS Regulator:

PoPSout=PoPSmaxkDkD+R\text{PoPS}_{\text{out}} = \text{PoPS}_{\text{max}} \cdot \frac{k_D}{k_D + R}

This models transcriptional repression where proteins bind DNA to reduce the outgoing signal.


Ring Oscillator Circuit

By chaining three inverters in a closed loop, we create a genetic ring oscillator. An odd number of NOT gates causes signal inversion to propagate over time, leading to oscillatory protein expression—a biological clock.

This behavior is central to many real-world synthetic constructs, including toggle switches and biostable memory units.


Implementation Highlights

  • Built in Python, simulating each inverter with its internal protein generator and repression logic.

  • Used Euler’s method for solving ODEs governing protein concentrations and signal flow.

  • Configurable parameters: half-life, production rate, repression constant kDk_D, PoPS max, and time steps.

  • Simulates both steady state and transient oscillatory dynamics.

     


    Example Output

     Input Signal : 70.0 PoPS
    Repressor Protein (Inverter 1): 1014.49
     
    PoPS Out (Inverter 1): 0.0689
    PoPS Out (Inverter 2): 35.0172
    PoPS Out (Inverter 3): 0.1376
    ...
     

    Conclusion

    This project reflects a broader vision of programmable biology. By modeling genetic circuits computationally, we gain early insight into their behavior—empowering researchers and students to design biological systems with the same logic we use for digital machines.

    This also opens pathways toward more advanced sequence-based modeling, which I plan to pursue as part of my future research in bioinformatics and computational genomics.

     

     


     

Wednesday, April 16, 2025

Understanding Topic Modeling with Latent Dirichlet Allocation (LDA)

📌 Understanding Topic Modeling with Latent Dirichlet Allocation (LDA)

🧠 Introduction

Topic modeling is a widely used technique in Natural Language Processing (NLP) that uncovers hidden thematic structures in a collection of documents. Among several approaches, Latent Dirichlet Allocation (LDA) remains one of the most prominent and interpretable probabilistic models.

In this post, we demonstrate a basic yet meaningful implementation of LDA, operating over a custom-defined corpus. While the dataset used is intentionally minimal for illustration, the underlying concepts scale well to real-world applications.


🗂️ Objective

To model topics within a small collection of text samples using LDA and observe how the model probabilistically distributes terms across discovered topics.


📚 Input Corpus

We begin with a simplified, tokenized corpus, where each document is represented as a list of words:

corpus = [
    ["apple", "banana", "apple", "fruit", "fruit", "banana"],
    ["dog", "cat", "dog", "animal", "pet", "cat"],
    ["banana", "fruit", "apple", "orange", "fruit", "banana"]
]

Here, two dominant domains are implied:

  • A fruit-related group of words

  • An animal/pet-related set

The intention is to assess if LDA can discover these latent groupings with minimal supervision.


🧪 Output (Sample Topic Distributions)

The LDA model assigns words to topics using probabilities. After training the model, the following topics emerged:

Topic 0

['0.999881*banana', '0.999881*fruit', '0.999822*apple', '0.965500*orange']

This topic predominantly encapsulates the fruit domain with a high concentration on terms like banana, fruit, apple, and orange.

Topic 1

['0.999918*dog', '0.999918*cat', '0.991821*animal', '0.991821*pet']

Topic 1 clearly captures the animal/pet domain, clustering relevant terms tightly together.

Note: Very low probability spillovers (on the order of 0.0001) are expected due to smoothing and the probabilistic nature of the model.


📌 Reflections

  • This experiment reaffirms LDA’s ability to distinguish distinct semantic domains, even with minimal data.

  • LDA models each document as a mixture of topics and each topic as a distribution over words, which makes it highly interpretable.

  • While this implementation focuses on a toy dataset, scaling this up with real-world corpora and preprocessing pipelines would allow for meaningful topic analysis in applications like content recommendation, academic article clustering, or trend detection in social media.



✅ Conclusion

Implementing topic modeling from scratch reinforces a solid understanding of probabilistic text analysis and paves the way for more sophisticated NLP pipelines. This milestone marks an important step in developing practical NLP skills with strong theoretical grounding.


Tuesday, March 25, 2025

TF-IDF: Enhancing Text Representation Beyond Bag of Words


Introduction

 

While the Bag of Words (BoW) model provides a simple way to represent text, it treats all words equally, failing to capture their importance in a document. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes into play. TF-IDF not only represents words numerically but also assigns weights based on their relevance within a corpus.

This article explores the concept, implementation, and significance of TF-IDF, building upon our previous work with BoW.


Understanding TF-IDF

TF-IDF is a weighting scheme that measures the importance of a word in a document relative to the entire collection of documents (corpus). It consists of two components:

  1. Term Frequency (TF): Measures how often a word appears in a document.

    TF(w)=Number of times w appears in a documentTotal number of words in the documentTF(w) = \frac{\text{Number of times } w \text{ appears in a document}}{\text{Total number of words in the document}}
  2. Inverse Document Frequency (IDF): Measures how rare a word is across all documents.

    IDF(w)=logTotal number of documentsNumber of documents containing wIDF(w) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing } w}

The final TF-IDF score is computed as:

TF-IDF(w)=TF(w)×IDF(w)TF\text{-}IDF(w) = TF(w) \times IDF(w)


Why TF-IDF Works Better Than BoW

  1. Reduces the impact of frequent words like “the” and “is” by assigning lower weights.

  2. Boosts important words that appear less frequently but are significant in meaning.

  3. Enables better document comparison, improving text search and ranking.


Challenges & Limitations

Although TF-IDF improves text representation, it has its own drawbacks:

  • Ignores word meaning and order, failing to capture relationships like synonyms or context.

  • Sparse representation, making it inefficient for large corpora.

  • Sensitive to rare words, which may sometimes receive excessive weight.

To address these, advanced word embeddings such as Word2Vec, GloVe, and Transformer-based models (BERT, GPT) are used for semantic understanding.



Conclusion

The TF-IDF model enhances text representation by weighing words based on their importance in a document. Implementing it from scratch reinforces a deep understanding of feature extraction in NLP. With this foundation, we are now prepared to delve into vector-based representations for richer textual meaning.



Why RAG Beat Fine-Tuning for Technical Question Answering

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B 🤗 Model 📊 Dataset 💻 Code Large language models have ...