Tajik to English Machine Translation with mBART
Author: josefgarcia
— course project — 4 min read| Code Repository URL | https://github.com/uazhlt-ms-program/ling-582-fall-2025-course-project-code-multilinguists |
|---|---|
| Demo URL (optional) | |
| Team name | Multilinguists |
Project description
For my project, I'm interested in exploring machine translation from Tajik to English. I'd like to provide performance of the transformer-based model mBART. Knowledge is not only in the form of english, but also in the crevaces of all languages. Tajik is a low resource language with relatively limited datasets and research, this provides a compelling test case. My goal is to evaluate how effectively mBART, an open-source multilingual transformer from Meta (Facebook), can translate between Tajik and English despite the data scarcity.
Plan of work
Week 1 (Nov. 17-21) - Data Collection and Preprocessing
Week 2 (Nov. 24-28) - Model Implementation and Training (mBART Transformer)
Week 3 (Dec. 1-5) - Reproducibility, Evaluation (BLEU scores), Analysis, and Report
Final Days (Dec. 6-7) - Polishing Report and Final Submission
Data
Tajik to English translation dataset
This data consists of questions and answers translated in both Tajik and English. The dataset is derived from the GSM8K dataset, containing grade school math problems and their solutions.
Training Data Shape: (7473, 4) Testing Data Shape: (1319, 4) Features: ['question', 'question_tj', 'answer', 'answer_tj']
Source: "https://huggingface.co/datasets/muhtasham/gsm8k-socratic-tajik/embed/viewer/default/train"
Tajik Source Text Lengths: Train - Avg Length: 73.68275092936803, Max Length: 16379 Validation - Avg Length: 77.9451871657754, Max Length: 8194 Test - Avg Length: 73.2915087187263, Max Length: 8189
English Target Text Lengths: Train - Avg Length: 64.91434944237918, Max Length: 310 Validation - Avg Length: 64.21390374331551, Max Length: 265 Test - Avg Length: 66.42683851402577, Max Length: 257
Summary of individual contributions
| Team member | Role/contributions |
|---|---|
| Joey Garcia | Everything |
Transformer Summary
To evaluate translation quality from Tajik to English, I fine tuned the multilingual sequence-to-sequence transformer mBART-50 (Many-to-Many MMT). The dataset was wrapped using a custom MBARTDataset class that tokenizes source(Tajik) and target(English) text pairs using MBart50TokenizerFast with a maximum sequence length of 128 tokens. mBART does not include a native Tajik language code, I used 'ru_RU' as the closest Cyrillic proxy for the source language token and 'en_XX' for English.
Training performed with:
- using mixed-precision (AMP)
- AdamW optimization
- linear learning-rate schedule with warmup
Training loop evaluated performance:
- end of each epoch using Validation BLEU
- computed by generating translations over the validation set with beam search
To prevent overfitting: Implemented early stopping with a patience value of 3. This halts training when the BLEU score failed to improve for three consecutive epochs. The best model (based on highest validation BLEU) was saved automatically. However, because mBART-50 Many-to-Many MMT is a very large model with hundreds of millions of parameters, meaningful improvements often occur slowly. For this project I trained for only 10 epochs, so early stopping was not triggered; it would become more valuable in longer training schedules (> 25 epochs) where the risk of overfitting is higher.
Results
We obtained a BLEU score above 60 for a low-resource language using a proxy Cyrillic tokenization (ru_RU) and this was surprisingly strong. Tajik and Russian token overlap worked better than expected.
Training Results
Across all 10 epochs there is a consistent improvement in training loss and validation BLEU. The training loss decreases strongly and validation BLEU improvement indicates that the model is learning Tajik to English effectively. The model is generalizing well because BLEU is steadily rising.
BLEU 62.82 by epoch 10
===== Epoch 1/10 ===== Training: 100% 94/94 [01:25<00:00, 1.33it/s, loss=1.02] Avg training loss: 3.7135 Validation BLEU = 48.10 New best BLEU (-1.00 -> 48.10)
===== Epoch 2/10 ===== Training: 100% 94/94 [01:25<00:00, 1.31it/s, loss=0.48] Avg training loss: 0.5710 Validation BLEU = 57.85 New best BLEU (48.10 -> 57.85)
===== Epoch 3/10 ===== Training: 100% 94/94 [01:25<00:00, 1.32it/s, loss=0.423] Avg training loss: 0.4268 Validation BLEU = 60.23 New best BLEU (57.85 -> 60.23)
===== Epoch 4/10 ===== Training: 100% 94/94 [01:25<00:00, 1.29it/s, loss=0.418] Avg training loss: 0.3721 Validation BLEU = 61.12 New best BLEU (60.23 -> 61.12)
===== Epoch 5/10 ===== Training: 100% 94/94 [01:25<00:00, 1.32it/s, loss=0.294] Avg training loss: 0.3343 Validation BLEU = 61.71 New best BLEU (61.12 -> 61.71)
===== Epoch 6/10 ===== Training: 100% 94/94 [01:25<00:00, 1.32it/s, loss=0.359] Avg training loss: 0.3062 Validation BLEU = 62.39 New best BLEU (61.71 -> 62.39)
===== Epoch 7/10 ===== Training: 100% 94/94 [01:25<00:00, 1.31it/s, loss=0.23] Avg training loss: 0.2848 Validation BLEU = 62.75 New best BLEU (62.39 -> 62.75)
===== Epoch 8/10 ===== Training: 100% 94/94 [01:25<00:00, 1.29it/s, loss=0.235] Avg training loss: 0.2685 Validation BLEU = 62.70 No BLEU improvement. Patience = 1/3
===== Epoch 9/10 ===== Training: 100% 94/94 [01:25<00:00, 1.30it/s, loss=0.22] Avg training loss: 0.2568 Validation BLEU = 62.82 New best BLEU (62.75 -> 62.82)
===== Epoch 10/10 ===== Training: 100% 94/94 [01:26<00:00, 1.31it/s, loss=0.255] Avg training loss: 0.2496 Validation BLEU = 62.82 No BLEU improvement. Patience = 1/3 Saved final model to ./data/model Best BLEU model saved to ./data/model/best_bleu
Tajik to Enlgish Translations (Test Results)
| src_tajik | translation | reference_english | |
|---|---|---|---|
| 0 | Духтари Ҷанет ҳар рӯз 16 тухм мегузорад. Вай ҳар субҳ се тухмро барои наҳорӣ мехӯрад ва ҳар рӯз бо чор тухм барои дӯстон muffins мепазад. Вай боқимондаро ҳар рӯз дар бозори деҳқонон барои 2 доллар барои ҳар тухми тозаи духтари мефурӯшад. Вай ҳар рӯз дар бозори деҳқонон чанд доллар ба даст меорад? | Janet's daughter puts in 16 eggs a day. She eats three eggs a morning for breakfast and bakes muffins for her friends with four eggs a day. She sells the rest every day at the farmer's market for $2 per freshly hatched daughter's egg. She spends $2 a day at the farmer's market selling the remaining eggs to her friends. How much money does she make each day at the farmer's market? | Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? |
| 1 | Як рўйи 2 болт нахи кабуд ва нимашон нахи сафед мегирад. Ҳамагӣ чанд болт лозим аст? | A river takes 2 blue notes and half as many white notes. How many notes total does it need? | A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? |
| 2 | Ҷош қарор медиҳад, ки хонаи худро флип кунад. Вай хонаи худро барои 80,000 доллар мехарад ва сипас 50,000 доллар барои таъмирҳо сарф мекунад. Ин арзиши хонаи онро 150% зиёд кард. Вай чанд фоида ба даст овард? | Josh decides to flip his apartment. He buys the apartment for $80,000 and then spends $50,000 on repairs. This increased the cost of the apartment by 150%. How much profit did he make? | Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make? |
| 3 | Ҷеймс қарор медиҳад, ки 3 спринтро 3 маротиба дар як ҳафта гузаронад. Вай дар ҳар спринт 60 метр давида меравад. Вай дар як ҳафта дар маҷмӯъ чанд метр давида меравад? | James decides to jog 3 jumps 3 times a week. He runs 60 meters on each jump. How many meters does he run in total per week? | James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? |
| 4 | Ҳар рӯз, Вэнди ба ҳар як мурғи худ се пиёла хӯроки омехтаи мурғ, ки донаҳо, гусфандон ва сабзавот дорад, медиҳад, то онҳоро солим нигоҳ дорад. Вай хӯроки мурғҳоро дар се хӯроки алоҳида медиҳад. Дар субҳ, вай ба гӯшаи мурғонаш 15 пиёла хӯрок медиҳад. Дар пасзамина, вай ба мурғонаш 25 пиёла хӯроки дигар медиҳад. Агар шумораи мурғҳои Вэнди 20 мурғ бошад, вай дар хӯроки ниҳоии рӯз чанд пиёла хӯрок бояд ба мурғонаш диҳад? | Every day, Wendi gives each of her chickens three cups of chicken mixture containing doughnuts, gumballs, and vegetables to keep healthy. She feeds the chickens in three separate meals. In the morning, she gives her chicken herd 15 cups of food. In the afternoon, she gives her chicken herd 3 cups of food. How many days will it take Wendi to feed all of her chickens the chicken mixture? | Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy. She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed. In the afternoon, she gives her chickens another 25 cups of feed. How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi's flock is 20 chickens? |
Error analysis
For this analysis we'll be evaluating the 5 rows above.
Semantic errors are the biggest problem (Row 0, 1, 4). The model sometimes misunderstands subjects, objects, or context. An example of this is in Row 1, translations show "Janet's daughter puts...", when the true example is "Janet's ducks lay...".
There are some lexical errors in regard to rare words or ambiguous nouns which caues mistakes for the model. The mistake that jump out to me is in row 1, we have "robe" (translation) and the true word is "river". Next we have in row 3 "sprint" (translation) anb the true intentional meaning is "jump".
There is a numerical error, quantitative details are mistranslated in row 4, we have "5" when it should be "25".
All in all, there are minor variations that are acceptable. An example of this is in row 2, the model handles numbers well and retains context. This is a quick analysis of the translations, this is enlightening and favorable considering the fact that we only ran 10 epochs.
Reproducibility
For proper reproducibilty, we need to run the juptyer notebook in Google Colab Pro (free education version) and use the "NVIDIA A100-SXM4-80GB" runtime option for optimal performance. With this GPU resource, the total run time averages about 2 hours when "running all".
The libraries (and their versions) used in the project can found at the bottom of the "mbart.ipynb" file.
Library Versions in m
protobuf : NOT INSTALLED transformers : 4.57.3 accelerate : 1.12.0 argparse : 1.1 torch : 2.9.0+cu126 pandas : 2.2.2 datasets : 4.0.0 sacrebleu : 2.5.1
Alternatively, if you want to run the model locally (I highly discourage), run the following commands:
conda create -n nlp_torch python=3.10conda activate nlp_torchpip install -r requirements.txtconda deactivate(use when finsihed)
Future improvements
If I had more time and GPU resources, I would run the training loop for 50 epochs (increase patience to 5) to find the limit of this model. Early stopping was implemented, but early stopping didn't find the most improved BLEU score. A longer training schedule would allow early stopping to reliably identify the true performance peak.
This project relied on a relatively small dataset with sentence patterns that tended to be repetitive. The model may have learned these structures efficiently without being exposed to broader linguistic variety. A natural next step would be to expand the training corpus by identifying larger and more diverse Tajik and English parallel datasets. We'd do this to explore different topics, writing styles, and levels of complexity. A richer dataset would help the model generalize better, reduce overfitting to narrow question-answer patterns, and produce more robust translations across real-world use cases.