Building a Recursive Descent Parser from Scratch in Python
Introduction
Natural Language Processing (NLP) is one of the most exciting fields in AI and Machine Learning. One of its core challenges is parsing, where we analyze the grammatical structure of a sentence. In this blog post, we will explore Recursive Descent Parsing (RDP) and implement a simple Context-Free Grammar (CFG) parser in Python.
What is Recursive Descent Parsing?
Recursive Descent Parsing is a top-down parsing technique where we attempt to match input sentences against a set of grammar rules by breaking them down recursively. It is widely used in compiler design and NLP applications such as sentence parsing and syntax checking.
Grammar :
We define a simple CFG with:
- Articles (
a, an, the) - Nouns (
cat, mouse, cow) - Verbs (
catches, sees, chases)
Grammar Rules:
S -> NP VP
NP -> AR N
VP -> V NP
Where:
S(Sentence) consists of aNP(Noun Phrase) followed by aVP(Verb Phrase).NPconsists of anAR(Article) and anN(Noun).VPconsists of aV(Verb) followed by anotherNP.
Python Implementation
Step 1: Define Word Categories
articles = ['a', 'an', 'the']
nouns = ['cat', 'mouse', 'cow']
verbs = ['catches', 'sees', 'chases']
Step 2: Tokenization
def tokenize(sentence: str):
return sentence.lower().split()
Step 3: Parsing Functions
def match(remaining_tokens):
return remaining_tokens[1:]
def v(remaining_tokens: list):
if remaining_tokens and remaining_tokens[0] in verbs:
print("Word to parse --->", remaining_tokens[0])
remaining_tokens = match(remaining_tokens)
return True, remaining_tokens
else:
print("Parsing Error!!")
return False, remaining_tokens
def ar(remaining_tokens: list):
if remaining_tokens and remaining_tokens[0] in articles:
print("Word to parse --->", remaining_tokens[0])
remaining_tokens = match(remaining_tokens)
return True, remaining_tokens
else:
print("Parsing Error!!")
return False, remaining_tokens
def n(remaining_tokens: list):
if remaining_tokens and remaining_tokens[0] in nouns:
print("Word to parse --->", remaining_tokens[0])
remaining_tokens = match(remaining_tokens)
return True, remaining_tokens
else:
print("Parsing Error!!")
return False, remaining_tokens
def np(remaining_tokens):
if remaining_tokens:
p1, remaining_tokens = ar(remaining_tokens)
p2, remaining_tokens = n(remaining_tokens)
print("Parsed noun part!!")
return (p1 and p2), remaining_tokens
def vp(remaining_tokens: list):
if remaining_tokens:
p1, remaining_tokens = v(remaining_tokens)
if not remaining_tokens:
return False, remaining_tokens
p2, remaining_tokens = np(remaining_tokens)
print("Parsed verb part!!")
return (p1 and p2), remaining_tokens
def s(tokens: list):
if tokens:
p1, remaining_tokens = np(tokens)
if not p1:
print("Parsing Failed at Noun Phrase")
return False
if not remaining_tokens:
print("Parsing failed because of an incomplete sentence")
return False
p2, remaining_tokens = vp(remaining_tokens)
if not p2:
print("Parsing Failed at Verb Phrase!!")
return False
if remaining_tokens:
print("Parsing Failed due to extra words!!!")
return False
print("Sentence Parsed Successfully!!")
return True
else:
print("Please enter a valid sentence!!")
return False
Step 4: Testing the Parser
sentence = "a cat sees a mouse"
tokens = tokenize(sentence)
s(tokens)
Output:
Word to parse ---> a
Word to parse ---> cat
Parsed noun part!!
Word to parse ---> sees
Word to parse ---> a
Word to parse ---> mouse
Parsed noun part!!
Parsed verb part!!
Sentence Parsed Successfully!!
Expanding the Grammar!!!
Currently, our parser only supports simple subject-verb-object (SVO) sentences. We can expand it by:
- Adding new tenses (
past, future, continuous) - Handling negation & questions (
Did the cat chase the mouse?) - Integrating a probability-based sentence generator
Real-World Applications
- Grammar checking tools (e.g., Grammarly, LanguageTool)
- Chatbots & Virtual Assistants (Understanding user input)
- Compilers & Programming Language Interpreters
- Voice Assistants (Siri, Alexa, Google Assistant)
Conclusion
Recursive Descent Parsing provides a simple yet effective way to analyze language. We implemented a basic CFG parser that correctly validates simple English sentences. The next step is enhancing it with more linguistic rules and applying it in real NLP tasks like machine translation and speech recognition.
🚀 Stay tuned for more upgrades as we expand this parser into an advanced NLP tool!
No comments:
Post a Comment