A sophisticated spell checker application that uses N-gram language models to detect and correct both non-word errors and context-aware real-word errors. Built with Python and PyQt5, it provides an interactive GUI with real-time spell checking and context-sensitive suggestions.
- Non-word Error Detection: Identifies misspelled words not present in the vocabulary (e.g., "teh" instead of "the")
- Real-word Error Detection: Detects contextually incorrect words using bigram language models (e.g., "I have to go their" should be "there")
- Red underline for non-word errors
- Blue underline for context-based errors
- Right-click context menu for correction suggestions
- Minimum Edit Distance (MED): Generates spelling suggestions up to edit distance of 2
- N-gram Language Models: Uses unigram and bigram models for context-aware checking
- Laplace Smoothing: Handles zero-probability issues in language model calculations
- Python 3.x
- PyQt5
- NumPy
- Clone the repository:
git clone <repository-url>
cd context-aware-spellchecker-using-N-gram-language-model- Install required dependencies:
pip install PyQt5 numpy- Prepare the corpus data:
- Create a
datadirectory in the project root - Move
corpus.txtto thedatadirectory, or add your own text files
- Create a
mkdir data
mv corpus.txt data/Run the application:
python main.pyThe application will open a text editor window where you can:
- Type or paste text
- Misspelled words will be underlined in red
- Contextually incorrect words will be underlined in blue
- Right-click on underlined words to see correction suggestions
- Click a suggestion to replace the word
.
├── main.py # Application entry point with PyQt5 GUI
├── spellcheckwrapper.py # Wrapper class integrating all spell checking functionality
├── non_word_checking.py # Non-word error detection and correction
├── real_word_checking.py # Context-aware real-word error detection
├── spelltextedit.py # Custom QTextEdit with spell checking support
├── highlighter.py # Syntax highlighter for marking spelling errors
├── correction_action.py # Custom QAction for correction menu items
└── corpus.txt # Training corpus for language models
The system preprocesses text by:
- Converting to lowercase
- Removing punctuation and special characters
- Splitting into sentences with
<SOS>(Start of Sentence) and<EOS>(End of Sentence) markers - Creating token lists for model training
- Unigram Model: Tracks individual word frequencies
- Bigram Model: Tracks word pair frequencies for context analysis
- Models are trained on the corpus during initialization
- Checks if a word exists in the vocabulary
- Generates suggestions using edit operations (insert, delete, replace, transpose)
- Ranks suggestions by Minimum Edit Distance
- Calculates bigram probability for word sequences
- Compares original word probability with similar word alternatives
- Suggests replacements if alternatives have higher probability in context
Applies add-one smoothing to handle unseen word pairs:
P(word2 | word1) = (count(word1, word2) + 1) / (count(word1) + V + 1)
where V is the vocabulary size.
Supports four edit operations:
- Deletion: Remove a character
- Insertion: Add a character
- Replacement: Change a character
- Transposition: Swap adjacent characters
Calculates sentence likelihood as:
P(sentence) = Σ log P(word_i | word_{i-1})
Higher probability indicates better grammatical context.
Input: "I have too books"
Detection:
- "too" is correctly spelled but contextually wrong
- Bigram model suggests "two" has higher probability after "have"
- Word is underlined in blue
- Right-click shows: "two (Real-word Error)"
This project builds upon concepts from: