Tuesday, February 18, 2025

 

Exploring N-Gram Based Text Generation: A Probabilistic Approach

Introduction

Text generation is a fundamental task in Natural Language Processing (NLP) that has various applications, from autocomplete systems to chatbot responses. One of the simplest yet effective approaches to text generation is based on n-grams, a probabilistic model that predicts the next word given a sequence of previous words.

In this study, we implemented bigram-based text generation using probability distributions derived from a given corpus. While this method does not capture deep semantics, it provides a structured way to generate syntactically coherent text.

Understanding N-Gram Models

An n-gram is a contiguous sequence of n words from a given text corpus. The probability of a word occurring is dependent on the previous (n-1) words, following the Markov Assumption:  

P(wnwn1,wn2,...,w1)P(wnwn1)P(w_n | w_{n-1}, w_{n-2}, ..., w_1) \approx P(w_n | w_{n-1}) 

This simplification allows us to estimate probabilities efficiently.

We implemented a bigram model (n=2), which computes the probability of a word occurring given the previous word. The model was trained on a limited vocabulary dataset, which affects its ability to generalize but serves as a foundation for structured text generation.

Implementation Overview

  1. Tokenization: The input text was split into individual words.
  2. Bigram Construction: Pairs of consecutive words were extracted.
  3. Probability Estimation:
    • Count previous word occurrences.
    • Count bigram occurrences.
    • Compute conditional probabilities using: P(wnwn1)=C(wn1,wn)C(wn1)P(w_n | w_{n-1}) = \frac{C(w_{n-1}, w_n)}{C(w_{n-1})}
  4. Sentence Generation:
    • Start with a seed word.
    • Predict the next word based on probabilities.
    • Repeat until a stopping condition is met.

Observations and Limitations

The generated sentences maintained a basic grammatical structure but lacked deeper coherence due to the absence of long-range dependencies and semantic understanding. Common limitations include:

  • Lack of Vocabulary Generalization: Unseen words result in failure to generate meaningful sequences.
  • Short-Term Memory: The model only considers one previous word.
  • No Semantic Awareness: Words are selected based on frequency rather than meaning.

Next Steps: Addressing Semantic Awareness

While n-gram models provide a structured approach to text generation, they are inherently limited in capturing meaning. Moving forward, we aim to incorporate semantic understanding using:

  • Word Embeddings (Word2Vec, GloVe, FastText) to capture word relationships.
  • Neural Language Models (LSTMs, Transformers) for contextual awareness.
  • Smoothing Techniques (e.g., Laplace Smoothing) to handle unseen words.

By integrating these approaches, we aim to enhance the quality of generated text and move towards more intelligent and coherent language models.

Conclusion

This work marks an important step in probabilistic text generation. The n-gram approach provides a simple yet effective method for structured text generation, forming the foundation for more advanced techniques. As we progress, we will refine our models to capture not only syntactic correctness but also deeper semantic understanding, making text generation more meaningful and contextually aware.

Wednesday, February 12, 2025

 

Probabilistic Context-Free Grammar (PCFG) for Sentence Generation

Introduction

Natural Language Generation (NLG) is a crucial aspect of computational linguistics, contributing to applications ranging from machine translation to text summarization. One foundational approach to structured text generation is through Probabilistic Context-Free Grammars (PCFGs). This article explores our implementation of PCFG-based sentence generation, a technique that extends traditional Context-Free Grammars (CFGs) by incorporating probabilistic rule selection.

Motivation

CFGs are widely used in syntactic parsing, but they are deterministic and lack variability. By assigning probabilities to different production rules, PCFGs allow us to model natural variations in sentence structure, making them more suited for real-world applications in NLP. Our implementation aims to generate grammatically correct and statistically probable sentences using a defined set of rules.

 

Methodology

Our PCFG is structured as a dictionary where:

  • Non-terminal symbols (e.g., S, NP, VP) map to possible expansions with assigned probabilities.
  • Terminal symbols (e.g., words like "cat" or "dog") serve as the final output of the generation process.
  • A recursive function expands a given symbol by selecting a production rule based on its probability distribution.

Grammar Structure

The implemented PCFG follows a simple sentence structure:

  • S → NP VP (85%) | S conj S (15%)
  • NP → Det N (30%) | Name (30%) | Det JJ N (40%)
  • VP → V NP (95%) | V (5%)
  • Det → {the, a}
  • JJ → {big, little, white, black}
  • N → {cat, dog, mouse}
  • V → {sees, chases}
  • Name → {Alice, Bob}
  • conj → {and, but}

Implementation Highlights

A recursive function iteratively expands symbols, choosing the next rule probabilistically using random.choices(). This process ensures natural variation in the generated sentences, unlike a deterministic CFG. Sample outputs include:

A mouse sees a white cat.
Bob chases a mouse.
A cat sees a big cat.
Bob sees a big dog and a white cat sees Bob and a big dog chases a cat.
A white mouse chases a dog.
Bob sees a white dog but Alice chases Alice.

These results showcase variability in sentence structure, demonstrating the effectiveness of probabilistic selection.

 

Conclusion & Future Work

Our PCFG-based sentence generator successfully creates structured yet diverse sentences. Moving forward, potential extensions include:

  • Incorporating higher-level linguistic constraints for grammatical correctness.
  • Training PCFGs on real-world corpora to derive probabilistic rules from actual text data.
  • Integrating machine learning models (e.g., Markov models or neural networks) to enhance fluency.

By refining probabilistic grammar techniques, we take a step closer to more realistic and coherent NLG models. This work serves as a foundation for future exploration into probabilistic language generation methods.

Monday, February 10, 2025

 

Advancing NLP with CFG: Completing the Parsing Phase and Preparing for Sentence Generation

Introduction

In our previous blog, we introduced a Context-Free Grammar (CFG) based parser, designed to validate English sentence structures through Recursive Descent Parsing (RDP). At that stage, our implementation supported basic sentence validation for present-tense structures. Since then, significant progress has been made—our parser now fully supports all 12 tenses, making it a robust rule-based syntactic analysis tool. This blog post will document the completed parsing phase and lay the groundwork for the next step: CFG-based sentence generation.

Parsing Phase: Achievements and Refinements

1. Expansion to All Tenses

Initially, our parser handled only simple present tense. To enhance its linguistic capabilities, we systematically expanded its grammar rules to include:

  • Past Tense (Simple, Continuous, Perfect, Perfect Continuous)
  • Future Tense (Simple, Continuous, Perfect, Perfect Continuous)
  • Present Tense (Completed forms: Continuous, Perfect, Perfect Continuous)

By implementing structured CFG rules for verb variations and auxiliary constructions, we ensured that our parser can now handle diverse grammatical patterns.

2. Error Detection and Sentence Validation

The recursive descent approach allows for strict syntactic validation, making the parser capable of:

  • Detecting structural errors in sentences.
  • Identifying incorrect tense formations.
  • Providing feedback on incomplete or grammatically incorrect inputs.

This feature can serve as an additional layer of validation for NLP models that rely on statistical parsers, which sometimes fail to catch grammar violations.

Challenges Faced During Parsing Expansion

Expanding CFG for all tenses was not without its challenges:

  • Ambiguity Management: English grammar contains ambiguities, especially in verb structures.
  • Lexical Limitations: The parser currently operates on a limited vocabulary set.
  • Scalability: While CFG is a strong rule-based approach, its expansion to accommodate a large corpus requires careful optimization.

Despite these limitations, our current implementation provides a solid foundation for further enhancements, particularly in sentence generation.

Next Steps: CFG-Based Sentence Generation

Having successfully completed the parsing phase, the next logical step is sentence generation using CFG. This involves:

  1. Building a Probabilistic CFG (PCFG): Assigning probabilities to grammar rules for varied sentence construction.
  2. Recursive Sentence Synthesis: Leveraging our existing parsing rules in reverse to generate grammatically correct sentences.
  3. Expanding Vocabulary: Adding a broader lexicon to enable diverse sentence generation.

The upcoming phase will allow our system not only to analyze language but also to construct meaningful and grammatically correct sentences. This step will be crucial for text synthesis, dialogue systems, and AI-generated content.

Conclusion

The completion of the parsing phase marks a significant milestone in this research-driven NLP project. With all 12 tenses supported, our CFG-based parser provides a structured approach to linguistic validation. As we transition into the sentence generation phase, our focus will be on implementing Probabilistic CFG, expanding vocabulary, and optimizing generation efficiency. The journey continues, and the next update will document our progress in making CFG-driven sentence generation a reality.

Stay tuned for further developments in this NLP research initiative!

Friday, February 7, 2025

Building a Recursive Descent Parser from Scratch in Python


Building a Recursive Descent Parser from Scratch in Python

 

Introduction

Natural Language Processing (NLP) is one of the most exciting fields in AI and Machine Learning. One of its core challenges is parsing, where we analyze the grammatical structure of a sentence. In this blog post, we will explore Recursive Descent Parsing (RDP) and implement a simple Context-Free Grammar (CFG) parser in Python.

 

What is Recursive Descent Parsing?

Recursive Descent Parsing is a top-down parsing technique where we attempt to match input sentences against a set of grammar rules by breaking them down recursively. It is widely used in compiler design and NLP applications such as sentence parsing and syntax checking.

Grammar :

We define a simple CFG with:

  • Articles (a, an, the)
  • Nouns (cat, mouse, cow)
  • Verbs (catches, sees, chases)

 

 

Grammar Rules:

 
            S  -> NP VP
            NP -> AR N
            VP -> V NP

 

Where:

  • S (Sentence) consists of a NP (Noun Phrase) followed by a VP (Verb Phrase).
  • NP consists of an AR (Article) and an N (Noun).
  • VP consists of a V (Verb) followed by another NP.

 

 

Python Implementation

Step 1: Define Word Categories

articles = ['a', 'an', 'the']
nouns = ['cat', 'mouse', 'cow']
verbs = ['catches', 'sees', 'chases']

Step 2: Tokenization

def tokenize(sentence: str):
    return sentence.lower().split()

Step 3: Parsing Functions

def match(remaining_tokens):
    return remaining_tokens[1:]

def v(remaining_tokens: list):
    if remaining_tokens and remaining_tokens[0] in verbs:
        print("Word to parse --->", remaining_tokens[0])
        remaining_tokens = match(remaining_tokens)
        return True, remaining_tokens
    else:
        print("Parsing Error!!")
        return False, remaining_tokens

def ar(remaining_tokens: list):
    if remaining_tokens and remaining_tokens[0] in articles:
        print("Word to parse --->", remaining_tokens[0])
        remaining_tokens = match(remaining_tokens)
        return True, remaining_tokens
    else:
        print("Parsing Error!!")
        return False, remaining_tokens

def n(remaining_tokens: list):
    if remaining_tokens and remaining_tokens[0] in nouns:
        print("Word to parse --->", remaining_tokens[0])
        remaining_tokens = match(remaining_tokens)
        return True, remaining_tokens
    else:
        print("Parsing Error!!")
        return False, remaining_tokens

def np(remaining_tokens):
    if remaining_tokens:
        p1, remaining_tokens = ar(remaining_tokens)
        p2, remaining_tokens = n(remaining_tokens)
        print("Parsed noun part!!")
        return (p1 and p2), remaining_tokens

def vp(remaining_tokens: list):
    if remaining_tokens:
        p1, remaining_tokens = v(remaining_tokens)
        if not remaining_tokens:
            return False, remaining_tokens
        p2, remaining_tokens = np(remaining_tokens)
        print("Parsed verb part!!")
        return (p1 and p2), remaining_tokens

def s(tokens: list):
    if tokens:
        p1, remaining_tokens = np(tokens)
        if not p1:
            print("Parsing Failed at Noun Phrase")
            return False
        if not remaining_tokens:
            print("Parsing failed because of an incomplete sentence")
            return False
        p2, remaining_tokens = vp(remaining_tokens)
        if not p2:
            print("Parsing Failed at Verb Phrase!!")
            return False
        if remaining_tokens:
            print("Parsing Failed due to extra words!!!")
            return False
        print("Sentence Parsed Successfully!!")
        return True
    else:
        print("Please enter a valid sentence!!")
        return False

 

Step 4: Testing the Parser

sentence = "a cat sees a mouse"
tokens = tokenize(sentence)
s(tokens)

Output:

Word to parse ---> a
Word to parse ---> cat
Parsed noun part!!
Word to parse ---> sees
Word to parse ---> a
Word to parse ---> mouse
Parsed noun part!!
Parsed verb part!!
Sentence Parsed Successfully!!

 

 

Expanding the Grammar!!!

Currently, our parser only supports simple subject-verb-object (SVO) sentences. We can expand it by:

  • Adding new tenses (past, future, continuous)
  • Handling negation & questions (Did the cat chase the mouse?)
  • Integrating a probability-based sentence generator

 

 

Real-World Applications

  • Grammar checking tools (e.g., Grammarly, LanguageTool)
  • Chatbots & Virtual Assistants (Understanding user input)
  • Compilers & Programming Language Interpreters
  • Voice Assistants (Siri, Alexa, Google Assistant)

 

Conclusion

Recursive Descent Parsing provides a simple yet effective way to analyze language. We implemented a basic CFG parser that correctly validates simple English sentences. The next step is enhancing it with more linguistic rules and applying it in real NLP tasks like machine translation and speech recognition.

🚀 Stay tuned for more upgrades as we expand this parser into an advanced NLP tool!

Why RAG Beat Fine-Tuning for Technical Question Answering

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B 🤗 Model 📊 Dataset 💻 Code Large language models have ...