Friday, February 7, 2025

Building a Recursive Descent Parser from Scratch in Python


Building a Recursive Descent Parser from Scratch in Python

 

Introduction

Natural Language Processing (NLP) is one of the most exciting fields in AI and Machine Learning. One of its core challenges is parsing, where we analyze the grammatical structure of a sentence. In this blog post, we will explore Recursive Descent Parsing (RDP) and implement a simple Context-Free Grammar (CFG) parser in Python.

 

What is Recursive Descent Parsing?

Recursive Descent Parsing is a top-down parsing technique where we attempt to match input sentences against a set of grammar rules by breaking them down recursively. It is widely used in compiler design and NLP applications such as sentence parsing and syntax checking.

Grammar :

We define a simple CFG with:

  • Articles (a, an, the)
  • Nouns (cat, mouse, cow)
  • Verbs (catches, sees, chases)

 

 

Grammar Rules:

 
            S  -> NP VP
            NP -> AR N
            VP -> V NP

 

Where:

  • S (Sentence) consists of a NP (Noun Phrase) followed by a VP (Verb Phrase).
  • NP consists of an AR (Article) and an N (Noun).
  • VP consists of a V (Verb) followed by another NP.

 

 

Python Implementation

Step 1: Define Word Categories

articles = ['a', 'an', 'the']
nouns = ['cat', 'mouse', 'cow']
verbs = ['catches', 'sees', 'chases']

Step 2: Tokenization

def tokenize(sentence: str):
    return sentence.lower().split()

Step 3: Parsing Functions

def match(remaining_tokens):
    return remaining_tokens[1:]

def v(remaining_tokens: list):
    if remaining_tokens and remaining_tokens[0] in verbs:
        print("Word to parse --->", remaining_tokens[0])
        remaining_tokens = match(remaining_tokens)
        return True, remaining_tokens
    else:
        print("Parsing Error!!")
        return False, remaining_tokens

def ar(remaining_tokens: list):
    if remaining_tokens and remaining_tokens[0] in articles:
        print("Word to parse --->", remaining_tokens[0])
        remaining_tokens = match(remaining_tokens)
        return True, remaining_tokens
    else:
        print("Parsing Error!!")
        return False, remaining_tokens

def n(remaining_tokens: list):
    if remaining_tokens and remaining_tokens[0] in nouns:
        print("Word to parse --->", remaining_tokens[0])
        remaining_tokens = match(remaining_tokens)
        return True, remaining_tokens
    else:
        print("Parsing Error!!")
        return False, remaining_tokens

def np(remaining_tokens):
    if remaining_tokens:
        p1, remaining_tokens = ar(remaining_tokens)
        p2, remaining_tokens = n(remaining_tokens)
        print("Parsed noun part!!")
        return (p1 and p2), remaining_tokens

def vp(remaining_tokens: list):
    if remaining_tokens:
        p1, remaining_tokens = v(remaining_tokens)
        if not remaining_tokens:
            return False, remaining_tokens
        p2, remaining_tokens = np(remaining_tokens)
        print("Parsed verb part!!")
        return (p1 and p2), remaining_tokens

def s(tokens: list):
    if tokens:
        p1, remaining_tokens = np(tokens)
        if not p1:
            print("Parsing Failed at Noun Phrase")
            return False
        if not remaining_tokens:
            print("Parsing failed because of an incomplete sentence")
            return False
        p2, remaining_tokens = vp(remaining_tokens)
        if not p2:
            print("Parsing Failed at Verb Phrase!!")
            return False
        if remaining_tokens:
            print("Parsing Failed due to extra words!!!")
            return False
        print("Sentence Parsed Successfully!!")
        return True
    else:
        print("Please enter a valid sentence!!")
        return False

 

Step 4: Testing the Parser

sentence = "a cat sees a mouse"
tokens = tokenize(sentence)
s(tokens)

Output:

Word to parse ---> a
Word to parse ---> cat
Parsed noun part!!
Word to parse ---> sees
Word to parse ---> a
Word to parse ---> mouse
Parsed noun part!!
Parsed verb part!!
Sentence Parsed Successfully!!

 

 

Expanding the Grammar!!!

Currently, our parser only supports simple subject-verb-object (SVO) sentences. We can expand it by:

  • Adding new tenses (past, future, continuous)
  • Handling negation & questions (Did the cat chase the mouse?)
  • Integrating a probability-based sentence generator

 

 

Real-World Applications

  • Grammar checking tools (e.g., Grammarly, LanguageTool)
  • Chatbots & Virtual Assistants (Understanding user input)
  • Compilers & Programming Language Interpreters
  • Voice Assistants (Siri, Alexa, Google Assistant)

 

Conclusion

Recursive Descent Parsing provides a simple yet effective way to analyze language. We implemented a basic CFG parser that correctly validates simple English sentences. The next step is enhancing it with more linguistic rules and applying it in real NLP tasks like machine translation and speech recognition.

🚀 Stay tuned for more upgrades as we expand this parser into an advanced NLP tool!

No comments:

Post a Comment

Why RAG Beat Fine-Tuning for Technical Question Answering

Fine-Tuning vs Retrieval-Augmented Generation: A Small Experiment with Mistral-7B 🤗 Model 📊 Dataset 💻 Code Large language models have ...