Extracting tri-literal root consonants from declined Arabic words
Author: mgatto
— course project — 6 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-arabic-tri-literal-root-extraction |
|---|---|
| Demo URL (optional) | N/A |
| Team name | Michael Gatto |
Project description
Arabic is a Semitic language within the Afro-Asiatic family. The distinguishing feature of Semitic languages are their templated morphology (Ryding 05). In a templated morphological system, parts of speech are constructed by applying known and reproducable templates to a short sequence of 2,3 or 4 consonantal roots. The vast majority of these root sequences consist of three consontants called "tri-literals". A much smaller number are either two or four consonants in length. We'll represent these tri-literal roots in general as C-C-C.
These roots convey an abstract meaning made concrete by applying one of a finite number of templates. Structurally, the templates consist of some combination of internal vowel patterns supplemented by a very limited set of prefixes, suffixes and infixes consisting of the phonemes [m] (prefix only), [n] (infix and suffix) or [t] (prefix, infix and suffix) and [ist] (prefix). For example, agent nouns usually take the pattern of C-aa-C-e-C. Thus, for the oft-cited root K-T-B which conveys the idea of writing, a K-aa-T-e-B is a person who writes, i.e. a writer: "kaateb".
Arabic grammarians have traditionally organized specific mixtures of affixes into 10 widely used forms to encode specific semantics. Each form contains set templates for declining verbs, forming agent nouns, pluralizing nouns, and forming perfect and imperfect particples. We'll consider only Form I words in this experiment, since they are the simplest. Form I consists of no infixes and are canonically represented by the 3rd person masculine singular perfect verb with the template C-a-C-a-C-a. For example, K-a-T-a-B-a (kataba: he wrote).
Computationally finding the roots is a type of Feature Extraction and is an important problem in Arabic NLP. Finding the roots of a word with high accuracy significantly affects foundational tasks in NLP such as POS tagging and various classification tasks. In turn, these foundational tasks can cause user-level applications such as spell-checking and language generation to fail laughably or succeed brilliantly (but often unnoticed since when things "just work", they tend to be less noticable).
For example, a very common derivational Form I prefix is ma- signalling the start of a participle. There are however, many roots which also begin with the same phoneme, ma-. Thus, simply regex-ing it away is likely to produce numerous mistakes and a disappointingly low F1 score.
Approach
I will use a character-level RNN, in the form of an LTSM implemented in Python with PyTorch. I frame this as a multi-class classification problem (AlSerhan, Ayesh 2006). As such, the RNN will output a vector classifying the char sequences as only one of:
- prefix,
- suffix.
infix(recall that Form I has no infixes),- root.
This output vector will then be fed into an MLP to select the probability of the letter being a root or not with a selecting function: .
Goal
Given a list of 100 Arabic test words of Form I derived from known, tri-literal consonantal roots, the Root Extractor successfully identifies at least 90 of them.
Outline
- Examine prior art and detail flaws with non-NN approaches.
- Experiment with al-Mus'haf dataset in a Jupyter Notebook.
- Experiment with CAMeL Tools for Arabic pre-processing.
- Determine per Yuval, the "core linguistic features that are relevant for predicting the output classes"
- Code RNN with PyTorch.
- Train model (possibly on HPC if training > 100,000?).
- Evaluate model.
- Weep and tune.
- Re-evaluate model and be happy.
- Package Python code into a Docker container (use uv and pyproject.toml, not pip and requirements.txt?).
Schedule
By these dates, accomplish:
- 11/16/2025 - Selected training set.
- 11/23/2025 - Have working RNN.
- 11/30/2025 - Tuned RNN and decoded output vector with MLP and Softmax to present words' root consonants.
- 12/05/2025 - Polished codebase and packaged into Docker container.
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Michael Gatto | Coder, Tester, Writer |
Results
Most stemmers for Arabic are largely rule-based, especially state of the art ones which approach an F1 of 98%. There is , however, a line of research in using neural network approaches which I wanted to pursue for its novelty. A remarkably early attempt at using Deep Learning is in Serhan, Ayesh (2006) who reported an accuracy rate of 84%.
A more recent paper using Deep Learning was Ziadi, Cheikh & Jemni (2017) (PDF) who reported a CNN model which "achieved a 97% classification rate in root word recognition" while their "preliminary experiments with the LSTM model gave a score equal to 87.2%". These approaches generally used classification techniques often paired with factoring out affixes. This had initially guided my first attempts based on my stated approach, which performed quite poorly.
I significantly changed my approach per Dr. Hahn-Powell's suggestion. Instead of classifying letters as belonging to the root class or not, I switched to simply generating three characters instead. This yielded extraordinary results:
| Metric | Value |
|---|---|
| Accuracy | 0.9924 |
| F1 Score (weighted) | 0.9924 |
| Precision (weighted) | 0.9924 |
| Recall (weighted) | 0.9924 |
This reflect's Kaddoura et.al's baseline of 98%, "The results demonstrate that ANN-based root extraction achieved the highest F1-score of 98%, while other state-of-the-art stemmers yield lower scores".
At first, this modest experiment initially processed only non-weak, non-geminated triliterals for simplicity. This means that each of the three letters were different and not "y", "w" nor the glottal stop. Encouraged by this unexpectedly superior performance, I unfiltered the geminated roots first. Geminated roots repeat the second character, for example: sh-d-d = to be severe. The results only lowered F1 by 1% which I found still quite acceptable.
In the final stage, I also allowed weak triliterals into the dataset. A weak root is a root containing "y", "w", or "`" (glottal stop) as one of the root consonants. Weak roots can comprise up to two of the three roots, and unstably drop out during word formation using templates applied to the root. For example, w-h-b = to bestow, becomes the 3rd person masculine singular verb ya-hib (*ya-whib). Words derived from weak root are quite frequent in Arabic, thus excluding them gave a less realistic result. Including them resulted in a noticable, but expected decline in F1:
| Metric | Value |
|---|---|
| Root-level Exact Match Accuracy | 0.8669 (736/849) |
| Character-level Accuracy | 0.9399 |
| F1 Score (weighted) | 0.9392 |
| Precision (weighted) | 0.9392 |
| Recall (weighted) | 0.9399 |
Character level accuracy means that the model chose a correct character, but that the characters was in a different position than its actual place in the root sequence.
For more about the model's architecture and training, please see the README file in the code repository.
Detailed training progressions are available for four late runs in the results directory of the code repository.
I did not need to use any HPC facility; using the MPS on my MacBook proved sufficient. Testing in my Docker container using only Pytorch 's "cpu" device yielded somewhat slower, but identical results.
Error analysis
Geminated final literal still prove mildly problematic. Some errors were:
| Status | Word | Actual | Predicted |
|---|---|---|---|
| ✗ | أضلوا | ضلل | غلل |
| ✗ | خلة | خلل | خلد |
| ✗ | مدرارا | درر | قرر |
| ✗ | أعمام | عمم | نعم |
Curiously, it would correctly predict the geminated 2nd and 3rd consonants, but somehow get the 1st consonant complettely wrong. This was the opposite of my expectations.
BUT, it got some correct:
| Status | Word | Actual | Predicted |
|---|---|---|---|
| ✓ | ملة | ملل | ملل |
| ✓ | يضلل | ضلل | ضلل |
Weak roots are indeed the most error prone. In run #20, 9 of 11 incorrect roots were weak:
| Status | Word | Actual | Predicted |
|---|---|---|---|
| ✗ | غيث | غيث | غوث |
| ✗ | شقي | شقو | شقق |
| ✗ | مليم | لوم | ملم |
| ✗ | باغ | بغي | بغغ |
| ✗ | تديرون | دور | دير |
| ✗ | مرء | مرء | مري |
| ✗ | فائزون | فوز | فيز |
| ✗ | تصلية | صلي | صلو |
| ✗ | راسيات | رسو | رسي |
For d-w-r, f-w-z, and r-s-w the model correctly understood that these were weak roots, but consistently preferred to generate a "y" each time instead of the correct "w". It could denote some kind of issue with the corpus' sample distributions. Note, that no feature engineering was done to hint at whether a word was based on a weak root or not.
More curiously, the model sometimes generates geminated roots when the actual root is in the weak class of triliterals.
Reproducibility
Please see the README file in the code repository.
Future improvements
Phonemic Restraints
There are phonemic restrictions on the letters which can appear as roots together in sequence. This phenomenon is well known to the literature: Arabic root cooccurrence restrictions (PDF). Restraints on proposed root combinations based on the immediately preceeding generated root character could prove useful, though I don't currently understand how to encode that as a feature for deep learning.
Bigger Corpus
I would like to use a much larger corpus. One canidate is the very large Oxford Arabic Corpus which also contains word-root pairs necessary for this supervised learning experiment.
There's a danger that some roots are undersampled. The DataSet class should use more sophisticated techniques to ensure there is a more even distribution between roots and words. For example, [r, H, m] is likely to be over-represented in this theological dataset in various appellations and verbal forms. A Git branch in the repository contains an implementation, but remains untested.
Quadriliterals
4-letter roots ought to be just as easy. Quadriliterals are typically not weak and would pose the same problems. However, since this model is written to generate three characters, it may perform oddly when it's expected to output four. This remains an active and fertile area for expansion.
One area which I won't consider expanding into is biliteral roots, or roots with only two letters which are very rare. Biliterals are usually particles fossilized from Arabic's earliest stages within Afro-Asiatic in the Neolithic.
Weak Roots
The model proved less capacble with so-called weak/defective roots. It is unclear at this point how to help the model perform better in selecting the correct, but often absent "y" or "w". More training and correcting class imbalances specifically in labeled roots containing weak consonants could prove useful.
Citations
- Ryding, K. (2005). A Reference Grammar of Modern Standard Arabic. Cambridge University Press.