Tuesday, March 25, 2025

TF-IDF: Enhancing Text Representation Beyond Bag of Words


Introduction

 

While the Bag of Words (BoW) model provides a simple way to represent text, it treats all words equally, failing to capture their importance in a document. This is where TF-IDF (Term Frequency-Inverse Document Frequency) comes into play. TF-IDF not only represents words numerically but also assigns weights based on their relevance within a corpus.

This article explores the concept, implementation, and significance of TF-IDF, building upon our previous work with BoW.


Understanding TF-IDF

TF-IDF is a weighting scheme that measures the importance of a word in a document relative to the entire collection of documents (corpus). It consists of two components:

  1. Term Frequency (TF): Measures how often a word appears in a document.

    TF(w)=Number of times w appears in a documentTotal number of words in the documentTF(w) = \frac{\text{Number of times } w \text{ appears in a document}}{\text{Total number of words in the document}}
  2. Inverse Document Frequency (IDF): Measures how rare a word is across all documents.

    IDF(w)=logTotal number of documentsNumber of documents containing wIDF(w) = \log \frac{\text{Total number of documents}}{\text{Number of documents containing } w}

The final TF-IDF score is computed as:

TF-IDF(w)=TF(w)×IDF(w)TF\text{-}IDF(w) = TF(w) \times IDF(w)


Why TF-IDF Works Better Than BoW

  1. Reduces the impact of frequent words like “the” and “is” by assigning lower weights.

  2. Boosts important words that appear less frequently but are significant in meaning.

  3. Enables better document comparison, improving text search and ranking.


Challenges & Limitations

Although TF-IDF improves text representation, it has its own drawbacks:

  • Ignores word meaning and order, failing to capture relationships like synonyms or context.

  • Sparse representation, making it inefficient for large corpora.

  • Sensitive to rare words, which may sometimes receive excessive weight.

To address these, advanced word embeddings such as Word2Vec, GloVe, and Transformer-based models (BERT, GPT) are used for semantic understanding.



Conclusion

The TF-IDF model enhances text representation by weighing words based on their importance in a document. Implementing it from scratch reinforces a deep understanding of feature extraction in NLP. With this foundation, we are now prepared to delve into vector-based representations for richer textual meaning.



Friday, March 21, 2025

Building a Bag of Words Model from Scratch: A Step Toward Lexical Representations


Introduction

Text data is inherently unstructured, and in Natural Language Processing (NLP), transforming raw text into a structured format is crucial. One of the fundamental techniques for achieving this transformation is the Bag of Words (BoW) model. Despite its simplicity, BoW forms the foundation for more advanced text representation methods such as TF-IDF and word embeddings.

This article documents an implementation of BoW from scratch, explaining its core principles, challenges, and future enhancements.


Understanding the Bag of Words Model

The Bag of Words model represents text as a collection of words, disregarding grammar and word order but preserving word frequency. Each document is transformed into a vector where each dimension corresponds to a unique word in the vocabulary, and the value represents the word’s occurrence in the document.

For example, consider the following two sentences:

  1.  t1 = "I love NLP" 
  2.  t2 = "NLP is amazing"
  3.  t3 = "This is why nlp is my research interest"

    The vocabulary set is: {

      'i',
      'love',
      'nlp',
      'is',
      'amazing',
      'this',
      'why',
      'my',
      'research',
      'interest'

    }

     
    Each sentence is represented as a vector:

     'doc0': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
     'doc1': [0, 0, 1, 1, 1, 0, 0, 0, 0, 0],
     'doc2': [0, 0, 1, 2, 0, 1, 1, 1, 1, 1]

      The frequency of each word forms the basis of the BoW representation.

       

      Result :  


       


       

      Challenges & Limitations

      While BoW is simple and effective, it has some limitations:

      1. Loss of Context: Word order is ignored, leading to different meanings being treated the same.
      2. Sparsity: Large vocabularies result in high-dimensional vectors with many zero values.
      3. Equal Weighting: All words contribute equally, even if some are more important than others.

      To overcome these, more advanced techniques such as TF-IDF and word embeddings (Word2Vec, GloVe, FastText) can be employed.


      Next Steps: Moving Beyond BoW

      Now that we’ve built BoW, the next step is to improve its effectiveness:

      • TF-IDF: A weighting scheme that reduces the impact of common words.
      • Word Embeddings: Capturing semantic meaning by learning word relationships.
      • Contextual Representations: Using deep learning models like BERT for advanced text understanding.

      This marks the completion of the first step in our Lexical Representation journey. The next challenge awaits!


      Conclusion

      The Bag of Words model remains a cornerstone in NLP despite its limitations. Implementing it from scratch reinforces a deeper understanding of text vectorization techniques. With this foundation in place, we now venture into semantic-aware representations, where meaning and context play a critical role in text understanding.


      Why RAG Beat Fine-Tuning for Technical Question Answering

      Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B 🤗 Model 📊 Dataset 💻 Code Large language models have ...